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a  result  of  giving  one  stimulus  to  an  experimental  unit,  what  we  obtain  is  not  just  one 
response  but  several  responses.  In  statistical  language,  we  deal  with  a  multivariate 
situation  as  opposed  to  univariate  situations.  Usually,  many  stimuli,  called  factors,  ax^ 
considered  at  many  levels  in  the  same  experiment.  Many  statistical  techniques  are  avail 
abi.e  to  analyse  this  type  of  data  and  to  draw  conclusions  therefrom.  The  present  work 
considers  one  such  technique,  the  identification  of  subgroups  of  individuals  on  the  basisj 
of  responses,  i.e.,  a  special  case  of  cluster  analysis.  Many  different  algorithms  pro¬ 
posed  for  detecting  clusters  have  been  reviewed.  These  fall  into  two  classes.-- (i)  those 
which  detect  clusters  of  variables--factor  analysis--and  (ii)  those  which  detect  clusters! 
of  experimental  units — cluster  analysis,.  What  we -have  done  in  the  present  work  lies  in- 
jbetween  these  two  techniques..,  We  first  perform  cluster  analysis  on  p  responses  for  each 
junit,  and  sort  the  experimental  units  into  groups.  After  assigning  experimental  units  to| 
groups,  we  look  for  those  individuals  whose  total  response  could  be  expressed  in  (p-1) 
combinations  of  original  responses,  irrespective  of  the  groups  or  clusters  to  which  they 
belong.  Those  individuals  whose  total  response  could  be  described  in  terms  of  (p-lj  com-l 
[binations  of  original  responses  are  said  to  lie  on  a  "simple  structure  plane."  A  geometifi 
probability  approach  has  been  used  to  determine  whether  a  given  point  lies  on  a  simple 
structure  plane.  The  orthogonal  distance  of  a  point  from  the  plane  is  used  as  a  criterion. 

If  many  points  lie  on  a  given  simple  structure  plane,  the  plane  is  said  to  be,  overdeter¬ 
mined.  Probability  expression  has  been  derived  to  determine  whether  a  simple  structure 
plane  can  be  regarded  as  being  overdetermined.  Those  individuals  which  lie  on  a  (p-1)- 
dimensional  hyperplane  in  a  p-dimensional  space,  need  one  variable  less  in  their  descriptlion . 
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In  this  case,  (p-1)  linear  combinations  of  the  original  p  variables,  could  be  used  for 
describing  these  points.  For  real  life  interpretation,  this  procedure  would  be  useless 
as  we  would  be  left  with  (p-1)  artificial  variables  instead  of  p  observable  variables. 

It  would  seem  reasonable,  .then,  to  eliminate,  for  the  data  points  close  to  one  of  the 
subspaces,  that  observable  variable  which  contributes  the  least  to  the  description  of 
this  selection  of  data  points.  A  correlation  approach  has  been  used  to  determine  the 
variable  to  be  eliminated,  and  it  turns  out  that  the  variable  which  correlates  most  stroj 
in  absolute  value  with  the  artificial  variable  should  be  discarded. 

Two  computer  programs  have  been  developed  to  assist  the  user  in  analysing  his  data 
using  this  approach.  The  first  one,  named  CLUSTR,  identifies  clusters  and  simple  struct^ 
planes.  The  second  one,  an  interactive  graphics  program,  named  ELLIPSE,  can  be  used  to 
visualize  the  configuration  of  clusters  and  the  underlying  simple  structures.  The 
clusters  are  projected  onto  many  2-dimensic>nal  spaces  and  displayed  on  the  IBM  2250 
Graphics  terminal.  If  the  plane  of  projection  selected  is  orthogonal  to  a  simple  struc¬ 
ture  plane,  the  points  lying  on  the  particular  simple  structure  plane,  will  make  a  band 
of  narrow  width  more  or  less  resembling  a  straight  line.  For  other  planes  of  projection 
this  will  not  hold.  The  clustering  of  points  is,  however,  unaffected  by  the  choise  of 
a.  particular  plane  of  projection. 
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CHAPTER  I 


INTRODUCTION 

1.1  Multivariate  Experiments 

Many  experiments  Involve  measuring  a  number  of  response  variables 
simultancf  isly.  Thus,  for  example,  in  determining  the  effectiveness  of  a 
new  drug,  a  person's  systolic  and  diastolic  blood  pressure  may  be  observed, 
before  and  after  administration  of  the  drug;  also  his  pulse  rate,  tempera¬ 
ture  and  other  physiological  data  may  be  recorded.  As  a  result  of 
giving  one  stimulus  to  an  experimental  unit,  what  we  obtain  is  not  just 
one  response  but  several  responses.  In  statistical  laneuaee.  we  deal  with 
a  multivariate  situation  (many  responses)  as  opposed  to  univariate  situ¬ 
ations  (only  one  response).  An  experiment  is  rarely  so  simple  as  des¬ 
cribed  above.  Usually,  many  stimuli,  which  will  be  called  factors,  are 
considered  at  many  levels  in  the  same  experiment.  Ultimately,  the  experi¬ 
menter  has  a  large  collection  of  data  before  him.  The  problem  is  how  to 
interpret  the  data.  Depending  upon  the  experimenter's  objective,  many 
statistical  techniques  are  available  to  analyse  the  data  and  to  draw  con¬ 
clusions  therefrom.  In  the  present  work,  wc  consider  one  such  technique, 
the  identification  of  subgroups  of  individuals  on  the  basis  of  responses, 
i.e.,  a  special  case  of  cluster  analysis.  In  the  following  sections,  we 
review  some  of  the  problems  of  the  subject  and  then  outline  the.  special 
problem  in  the  presunt  dissertation. 
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1.2  Review  of  the  related  literature 

Consider  an  experiment  where  many  responses  are  recorded  on  each 
experimental  unit.  We  shall  assume  that  there  are  u  experiments!  unit?, 
and,  on  each  unit,  p  responses  are  measured.  The  resulting  data  can  be 
put  In  the  £orm  of  an  n  x  p  matrix.  Each  row  of  this  matrix  correaponds 
to  one  experimental  unit.  Each  experimental  unit  can  be  represented  by  a 
point  in  a  p-dimensional  space.  It  Is  possible  that  some  of  these  points 
will  be  so  close  to  one  another  that  they  form  a  "cluster."  The  problem 
of  detecting  clusters  has  been  considered  by  many  authors  during  the  last 
30  years.  Tryon  [37]  in  1939  gave  many  algorithms  based  on  the  correla¬ 
tion  matrix  of  variables,  for  the  related  problem  of  assigning  variables  to 
groups.  The  technique,  much  similar  to  the  concept  of  the  "coefficient  of 
belonging"  described  by  Harman  [14],  was  developed  on  the  assumption  that 
correlations  among  variables  belonging  to  the  seme  group  should  be  much 
higher  than  correlations  between  these  variables  and  those  not  belonging 
to  the  group.  Holzinger  and  Harman  115}  defined  their  coefficient  of 
belonging  or  B-coefficlent  as  "100  times  the'  ratio  of  the  average  of  the 
Intercorrelations  among  the  variables  of  a  group  to  their  average  corre¬ 
lation  with  all  the  remaining  variables. 

Sohal  and  Michener  proposed  the  weighted  mean-pair  method  to 
Identify  clusters.  This  method  was  originally  applied  to  an  entomological 
problem  [33].  A  more  recent  description  of  this  method  has  been  given  by 
Sokal  and  Sneath  [34]  who  recommend  it  as  "the  best  of  a  clasB  of  commonly 
used  methods  of  cluster  analysis."  The  method  operates  on  an  N  x  N 
similarity  matrix.  Those  two  individuals,  i  and  j  say,  which  have  the 
highest  similarity  are  paired  together,  i.e.,  put  in  the  same  cluster. 
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Any  appropriate  measure  of  similarity  e.g.,  product  moment  correlation 
coefficient,  coefficient  of  association,  etc.,  can  be  used,  although  by 
far  the  mos c  commonly  unuu  ucaouic  is  Che.  product  moment  correlation  co¬ 
efficient.  After  the  Individuals  i  and  j  have  been  paired  together, 
columns  (and  rows)  1  and  j  of  the  similarity  matrix  are  replaced  by  a 
single  column  consisting  of  the  means  of  the  elements  in  rows  (and  columns) 
1  and  j.  The  process  is  then  repeated  on  the  new  matrix  of  order  (N-l), 
when  either  two  new  Individuals  have  the  highest  similarity  and  form  a 
new  pair,  or  the  existing  pair  combines  with  a  further  individual  to  make 
a  cluster  of  three.  The  process  continues  and  at  each  stage  a  new 
duster  consisting  of  a  pair  of  individuals  is  formed  or  the  new  indi¬ 
vidual  is  assigned  to  one  of  the  already  existing  clusters  of  previously 
combined  Individuals. 

Edwards  and  Cavalll-Sforza  [?]  suggest  dividing  the  points  into 
two  seta  3uch  that  the  sum  of  squares  of  distances  between  sets  is  a  maxi¬ 
mum.  Thus  according  to  this  method,  one  can  find  only  two  clusters,  no 
more  and  no  less.  Since  the  total  sum  of  squares  is  a  constant  for  a 
given  sample,  maximizing  between-group3  sum  of  squares  is  equivalent  to 
minimizing  the  within-groups  sum  of  squares.  The  method  consists  of 
examining  all  the  2**  *  -  1  two-set  partitions  of  N  individuals  and 
selecting  the  one  which  gives-  the  minimum  within-set  sum  of  squares.  The 
method  is  not  suitable  for  a  large  value  of  N  as  the  time  required  on  a 
computer  to  examine  all  the  two-set  partitions  is  enormous.  It  was  esti¬ 
mated  that  with  N  “  21,  the  time  required  to  examine  all  the  partitions, 
on  a  computer  with  5  micro-second  access  time  would  be  100  hours  and  that 
for  N  ■  41,  it  would  be  54,000  years.  Thus  even  with  the  help  of.  the 
fastest  computers,  the  method  would  be  impracticable.  One  more  algorithm 
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based  on  ecological  applications  was  proposed  by  Williams  and  Lambert 
I3dj.  Gower  L 12 J  gives  an  excellent  comparison  or  the  laBt  three  men¬ 
tioned  algorithms  together  with  soma  of  his  own  modifications  proposed 
for  these  algorithms. 

Another  Important  study  was  made  by  Neyman  and  Scott  [27].  They 
extensively  studied  the  clustering  of  galaxies  in  the  universe  and  pre¬ 
sented  the  theories  of  "simple  clustering"  and  "multiple  clustering." 
"Simple  clustering"  was  based  on  the  assumption  that  galaxies  occur  in 
clusters  and  that  the  cluster  centers  are  uniformly  distributed  through¬ 
out  the  universe.  In  "multiple  clustering"  it  was  assumed  that  the 
cluster  centers  radiate  from  super  clusters. 

From  the  above  discussion,  it  is  clear  that  in  the  statistical 
literature,  we  come  across  two  types  of  clusters — (i)  the  clusters  of 
variables  and  <ii)  the  clusters  of  individuals  or  points  or  experimental 
units.  Without  going  into  the  details  of  the  confusion  that  these  two 
concepts  have  created  and  their  uses  and  misuses,  we  only  note  that  given 
a  multivariate  sample,  it  is  possible  to  identify  the  underlying  clusters 
by  applying  any  of  the  suitable  techniques  available. 

We  now  describe  in  detail  an  algorithm  proposed  by  Bargmann  and 
Graney  [5]  to  determine  clusters  with  the  object  of  identifying  mixed 
samples  of  multivariate  normal  distributions.  With  the  help  of  methods 
to  be  developed  in  the  present  work,  we  may  then  study  the  configuration 
of  such  subgroups . 

1 . 3  Real  and  Virtual  Clusters 

Real  clusters  are  defined  to  be  clusters  of  points  in  the  original 
space.  Virtual  clusters  on  the  other  hand  are  clusters  of  projections  of 
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points  in  a  space  of  dimension  lower  than  that  of  the  original  space.  To 
borrow  «U  example  from  Graney  [lUJ,  if  one  were  to  look  for  clusters  of 
stars  as  observed  from  the  earth,  one  would  be  dealing  with  virtual 
clustering.  The  observer  perceives  the  stars  as  projections  onto  the 
surface  of  the  celestial  sphere.  To  determine  the  real  clusters  of  stars, 
It  would  be  necessary  to  measure  the  distance  of  each  star  from  the  ob¬ 
server.  It  is  also  obvious  from  the  above  example  that,  virtual  clusters 
may  not  necessarily  be  real  clusters,  and  vice  versa.  One  may  tend  to 
think  that  real  clusters  would  necessarily  be  virtual  clusters  also.  This, 
however,  Is  not  so.  If  one  were  standing  In  the  midst  of  a  real  cluster, 
he  may  not  find  any  cluster  at  all.  In  this  connection.  It  should  also 
be  noted  that  In  the  algorithm  proposed  by  Graney  [10],  If  a  cluster  Is 
centered  at  the  center  of  the  entire  system,  it  would  not  be  detected. 

.This  is  similar  to  Identification  of  galaxies.  Being  a  member  of  our  own 
galaxy,  we  do  not  obtain,  by  direct  observation,  a  description  of  the  con¬ 
figuration  of  our  galaxy.  For  that  purpose,  we  will  have  to  make  calcu¬ 
lations  based  upon  distance  measurements  (or  observe  from  a  different 
galaxy) . 

1.4  d-clusters  and  k-clusters 

As  experiemental  units  are  assigned  to  clusters,  a  decision  has 
to  be  made  whether  two  units  are  close  enough  to  justify  their  inclusion 
in  the  same  cluster.  One  may  look  at  this  problem  in  two  directions:  The 
so-called  d-clusters  and  k-clusters.  Consider  a  region  S  of  fixed  radius 
d.  This  is  said  to  be  overdetermined  at  the  a-levcl  of  significance  if 
the  number  of  points  in  the  region  exceeds  a  value  kQ  such  that,  under 
the  null  hypothesis  of  uniform  distribution  of  points, 


P  [kQ  or  more  points  in  S]  <  a 

This  type  of  cluster  Is  known  as  d-duster  since  It  results  from  a  region 
of  fixed  radius  d. 

The  other  type  of  clustering  results  when  a  fixed  number  of  points 
fall  into  a  region  with  sufficiently  small  radius. '  Let  there  be  a  fixed 
number  of  points,  say,  k.  Let  dQ  be  the  radius  of  a  sphere  (or  hyper¬ 
sphere)  just  sufficient  to  enclose  all  these  k  points  within  the  sphere. 

It  is  apparent  that  d,  the  radius  of  the  sphere  is  a  random  variable.  If 
P  [d  <  deH  a 

where  a  is  the  predetermined  level  of  significance,  then  these  k  points 
are  said  to  form  an  overdetermined  cluster.  Such  a  cluster  is  known  as 
the  k-cluster  since  it  results  from  a  fixed  number  of  points  falling  with¬ 
in  a  sufficiently  small  sphere. 

1,5  Identification  of  Mixed  Samples 

Consider  the  case  of  k-cluster  discussed  above.  As  an  example  of 
this  type  of  cluster,  consider  an  experiment  in  which  the  number  of  people 
given  a  drug  from  a  certain  class  of  drugs  (which  produce  similar  effects) 
is  limited.  Then  one  would  like  to  know  if  the  symptoms  among  some  indi¬ 
viduals  are  more  closely  alike  than  those  of  other  individuals.  In  other 
words,  it  would  be  appropriate  to  see  if  there  are  some  individuals  vhose 
symptoms  are  so  close  that  they  form  a  cluster.  But  this  also  implies  that 
we  are  looking  for  a  principle  of  classification  which  distinguishes  this 
group  of  individuals.  Such  a  situation  can  arise  in  linear  analysis.  For 
the  sake  of  simplicity,  we  shall  describe  the  situation  in  terms  of  uni¬ 
variate  analysis,  but  it  applies  equally  well  to  multivariate  analysis. 

In  a  two-factor  experiment  one  being  applied  at  r  levels  and  the  other  being 
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applied  at  s  levels,  we  will  have  r  x  s  cells;  let  us  assume  that  there 

are  observations  in  cell  -(i,  j).  Analysis  of  these  data  on  the  basis 

of  a  linear  statistical  model  assumes  homogeneity  of  cell  variances.  The 

hypothesis  of  the  equality  of  cell  variances  can  be  tested  by  application 

of  Bartlett's  test.  If  this  hypothesis  should  be  rejected,  three  possible 

2 

causes  can  be  responsible:  (1)  The  cell  variances  are  not  constant,  a  , 
but  proportional  to  some  known  v^  (a  different  one  for  every  cell).  A 
variance-stabilization  transformation,  or  weighted  regression,  or  both, 
can  be  employed  to  correct  .this  situation,  (ii)  There  may  be  "mavericks" 

— . oisclassifled  or  Improperly  recorded  items.  They  can  be  omitted  before 
the  analysis  of  the  data,  (iii)  There  may  be  a  third  factor  of  classi¬ 
fication  present.  If  this  is  the  case,  what  we  regard  as  "error  sum  of 
squares"  is  in  fact  not  the  error  sum  of  squares  but  the  variance  compo¬ 
nent  sum  of  squares  error,  plus  sum  of  squares  due  to  the  third  factor 
which  we  have  not  taken  into  account.  In  multivariate  analysis  of  variance, 
this  quantity  would  be  the  (H  +  E)  matrix  instead  of  the  E  matrix  alone 
where  E  stands  for  the  "error"  SSP  matrix  and  H  stands  for  the  "hypothesis" 
SSP  matrix.*  Hence  the  problem  reduces  to  that  of  detecting  the  third 
hidden  factor.  It  is  clear  that  the  sample  within  each  cell  comes  from 
two  or  more  populations  instead  of  from  just  one.  It  will  be  necessary  to 
"unreix"  the  samples  within  each  cell  before  a  valid  linear  statistical 
analysis  of  the  data  can  be  performed.  Instead  of  describing  the  technique 
of  unmixing  the  mixed  samples,  we  refer  to  Graney  [10],  for  a  full  des¬ 
cription  of  the  technique,  which  also  contains  a  number  of  illustrations. 

*SSP  sum  of  squares  and  products,  sometimes  also  called  "Wisharc" 

matrix. 
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There  has  been,  in  recent  years,  a  considerable  resurgence  of 
interest  in  cluster  analysis.  Ling  [23]  has  discussed  many  techniques 
and  he  has  tried  to  classify  these.  It  is  apparent  that  there  is  little 
purpose  in  inventing  yet  new  similarity  Indices,  distance  measures,  or 
search  algorithms.  The  present  dissertation  deals  with  a  point  inter¬ 
mediate  to  the  two  problems  which  cluster  analysis  has  attempted:  (a) 
classification  of  individuals,  on  the  basis  of  responses  and  (b)  classi¬ 
fication  of  response  variables  assuming  a  homogeneous  group  of  individuals 
(really  factor  analysis).  Cattell  [6],  and  Stephenson  [35]  view  the  entire 
complex  as  "factor  analysis"  and  call  the  first  problem  the  "Q  technique," 
the  second  problem  the  "R  technique,"  and  the  combined  problem,  the  "P 
technique."  Unfortunately,  their  techniques  do  not  lend  themselves-  to  a 
study  of  configurations  because  in  their  attempt  to  regard  every  problem 
as  a  correlational  one,  the  authors  perform  analyses  which  become. self¬ 
contradictory.  For  example,  sums  of  squares  and  products  can  be  used  as 
terms  in  estimates  of  correlation  between  variables  only  If  the  individuals 
are  independent,  and  conversely,  "correlations"  between  individuals  can  be 
estimated  by  a  product-moment  approach  only  if  the  variables  are  uncorre¬ 
lated.  Quadratic  forms  would  be  needed  otherwise,  and  the  sums  of  squares 
and  products  can  be  quite  meaningless.  The  present  study  avoids  this 
confusion  by  separating,  at  each  stage,  the  clustering  problem  (cluster 
of  individuals)  from  the  problem  of  structures  of  variables  within 
clusters. 

Guttman  [11]  has  proposed  a  yet  another  technique  to  reduce  the 
dimensionality  of  data.  The  technique,  "nonmetric"  in  nature,  works  on  an 
n  x  n  symmetric  matrix  R,  and  determines  those  transformations  which,  yield 
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an  Euclidean  coordinate  system  X  (X  :  n  x  m) ,  such  that  XX'  »  F  for  m 
a  minimum  and,  furthermore,  satisfy  all  Inequalities  that  whenever 
then  for  the  non-diagonal  elements  of  R  and  F, 

(i  +  J,  k  i  1).  This  avoids  the  problem  of  communalities  and  “when  some 
lawful  structure  or  pattern  is  present  in  the  data,  e.g.,  a  simplex,  a 
clrcumplex,  or  a  rad ex,  a  nonmetric  analysis  will  reveal  the  configuration, 
whereas  a  metric  approach  will  obscure  the  lawfulness."  For  detailed  dis¬ 
cussion  of  this  approach  and  the  algorithms  developed  in  this  connection, 
refer  to  Guttman  [11]  and  Lingoes  and  Guttman  [24],  It  Is  clear  that  this 
technique  will  reduce  the  dimensionality  of  all  the  data  points.  Again, 
as  in  factor  analysis,  it  will  not  be  affected  by  mean  shifts.  It  is  thus 
an  alternative  to  structural  and  factor  analyses  and  not  an  "in-between" 
solution  as  proposed  by  us  in  section  1.6. 

1.6  Definition  of  the  Problem: 

In  the  above  sections,  we  have  given  a  brief  outline  of  traditional 
approaches  to  cluster  analysis.  The  discussion  reveals  that,  given  a 
multivariate  sample,  it  is  always  possible  to  detect  some  underlying 
cluster  structure  and  to  assign  points  (experimental  units)  to  the  clusters 
to  which  they  belong.  This  is  not  the  only  kind  of  analysis  that  can  be 
performed.  The  same  data  could  also  be  subjected  to  factor  analysis, 
which  assigns  response  variables  to  classes.  Cluster  analysis  will  group 
the  individuals  bringing  out  the  mean  effects  and  leaving  the  variable 
structure  unaltered.  Factor  analysis  will  describe  the  data  in  terms  of 
artificial  variables  without  telling  anything  about  the  group  means  in¬ 
volved.  What  we  propose  to  do  in  the  present  work  lies  in-between  these 
two  extremes.  We  first  perforin  cluster  analysis  on  p  responses  for 
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each  unit,  and  sore  Cub  experiment?!  units  Into  groups.  After  assigning 
experimental  units  into  groups,  ve  look  for  those  individuals  whose  total 
response  could  be  expressed  in  (p-1)  combinations  of  original  responses, 
irrespective  of  the  groups  or  clusters  to  which  they  belong.  Thus  the 
ultimate  purpose  of  our  analysis  is  to  elicit  more  information  from  a  set 
of  multivariate  data  than  is  possible  by  cluster  or  factor  analysis  alone. 
Frequently  both  of  these  extremes  produce  trivial  results.  In  medical 
applications,  cluster  analysis  tends  to  producing  clusters  of  patients  who 
are  healthy,  slightly  ill,  and  very  ill.  By  contrast,  factor  analysis 
identifies  a  collective  set  of  symptoms  such  as  "fever,"  "pain,"  arid 
"chills." 


The  algorithm  developed  in  the  present  work  begins  by  identifying 
clusters  in  the  sense  of  estimating  parameters  of  mixed  multivariate 
normal  distributions  with  equal  variance-covariance  matrices.  If  this 
were  not  done  the  unit  ellipsoids  around  the  grand  mean  would  be  affected, 
uncontrollably,  by  mean  shifts.  Within  each  of  these  ellipsoids,  we  look 
for  well  overdetermined  subspaces  of  lower  dimensionality,  in  the  sense 
that  we  want  a  very  significant  number  of  individuals  to  fall  into  a  region 
close  to  these  highly  overdetermined  subspaces.  These  will  be  called 
"Simple  Structure  Planes." 

We  use  this  term  because  of  its  similarity  with  one  criterion 
which  Thurstone  [36]  used  in  describing  a  "simple  structure"  in  the  common- 
factor  space  of  factor  analysis.  If  all  principal  components  of  the 
within-clusto.r  matrix  were  calculated  and  "rotated,"  the  results  of  our 
study  could  also  be  produced.  This,  however,  would  be  a  tremendously 
wasteful  procedure. 
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ThoBe  Individuals  which  have  a  large  distance  from  the  subspacc, 
yet  belong  to  the  original  cluster,  would  differ  from  the  others  in  that 
they  require  at  least  one  additional  diagnostic. 

Geometrically!  the  Euclidean  distance  on  a  metric  given  by  the 
unit  ellipsoid,  is  obtained  as  follows:  A  vector  is  passed  from  the  center 
of  the  ellipsoid  (0)  to  the  point  in  question  (A) .  This  vector  intercepts 
the  unit  ellipsoid  at  point  (S).  The  Euclidean  distance  is  then 
(length  0A)/(length  OS).  For  this  reason,  the  ellipsoid  is  called  "unit 
ellipsoid."  It  generalizes  the  concept  of  a  "unit  Interval"  in  one  dimen¬ 
sion  (hence  metric).  The  computer  output  reports  these  distances  as  a 
vector  v, 

Now,  the  largest  projected  distance  of  a  point,  onto  a  ?. -dimensional 
subspace,  from  the  simple  structure  unit  ellipsoid,  is  at  most  equal  to  the 
actual  distance  in  p  dimensions.  Hence  a  point  showing  an  appreciable 
distance  from  the  simple  structure  ellipsoid,  on  any  projection,  would  be 
representative  of  a  point  requiring  one  variable  more,  for  adequate  des¬ 
cription,  than  the  points  lying  in  the  simple  structure  region.  This 
property  is  not  related  to  the  clustering  of  the  points.  Points  in  the 
same  simple  structure  subspace  may  be  far  removed  from  each  other;  they 
may  belong  to  different  clusters. 

The  points  having  a  large  (positive  or  negative)  distance  from 
each  of  these  planes  can  now  be  scrutinized.  They  share  some  character¬ 
istics,  which  makes  them  different  from  the  other  points  on  this  plane. 

It  is  to  be  noted  here  that  this  procedure  was  not  intended  to  find  sub¬ 
clusters— one  could  do  that  by  tightening  the  control  constants  in  the 
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original  program— but  to  study  subspaces  of  lower  dimensionality.  Thus 
the  multivariate  data  are  viewed  from  a  new  point  of  reference,  and  one 
may  identify  principles  ~penniillu&  a  different  tssssesy  of  mental 

units  (patients,  plants,  etc.). 

There  is  a  certain  analogy  of  techniques  between  the  eubspace 
solution  and  the  "simple  structure  rotation"  in  factor  analysis.  This 
is  expected  in  virtual  clusters.  The  cosine  of  the  angular  distance  be¬ 
tween  two  vectors  can  be  regarded  as  a  correlation  if  the  vectors  repre¬ 
sent  variables.  It  is  in  this  sense  that  our  n  data  points  correspond 
to  n  correlated  variables  of  factor  analysis  and  our  variables  them¬ 
selves  correspond  to  the  factors.  With  this  understanding  we  can  apply 
the  rotational  techniques  to  the  original  data  matrix  in  order  to  obtain 
points  lying  on  simple  structure  planes. 

In  the  present  dissertation,  we  have  synthesized  these  two 
techniques — cluster  analysis  and  simple  structure  identification — into  a 
■ingle  program.  When  the  user  subjects  his  data  to  this  program,  named 
CI.USTR,  he  gets  clusters  and  simple  structure  planes  as  the  output.  Next, 
we  have  developed  an  interactive  graphics  program,  named  ELLIPSE,  which 
can  be  used  to  visualize  the  configuration  of  clusters  and  the  underlying 

i 

simple  structures.  Configuration  of  multidimensional  clusters  can  bd 
determined  only  if  their  dimensionality  is  reduced.  For  this  purpose,  it 
is  necessary  to  project  the  clusters  onto  several  2  or  3  dimensional 
subspaces.  The  emphasis  in  the  present  work  is  on  projecting  the  clusters 

2 

Those  readers  who  assume  that  this  is  analogous  to  factor  analysis 
should  be  reminded  that  the  latter  increases  the  dimensionality  from  p  corre 
lated  to  p  +  k  (at  least  partially)  uncorrelated  variables.  There  is  of 
course,  no  relationship  to  an  incomplete  "component  analysis"  which  produces 
singular  solutions. 
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onto  many  2-dimenslonal  spaces  and  displaying  them  on  the  IBM  2250 
Graphics  terminal.  If  the  plane  of  projection  selected  is  orthogonal  to  a 
simple  structure  plane,-  the  points  lying  on  the  particular  simple  structure 
plane  will  make  a  band  of  narrow  width  more  or  less  resembling  a  straight 
line.  The  display  program  can  also  b*  used  in  an  exploratory  manner.  The 
user  can  supply  many  different  vectors.  If,  by  using  some  vector  of  pro¬ 
jection,  he  sees  narrow  bands  as  described  above,  the  corresponding  vector 
is  orthogonal  to  a  simple  structure  pla.ie.  Such  visual  determination  of 
simple  structure  planes,  however,  is  rather  difficult  especially  If  the 
user  is  required  to  make  inferences  on  configurations  In  four  or  more  di¬ 
mensions,  on  the  basis  of  2-dimensional  displays.  It  is  implicit  from  the 
above  discussion  that  the  determination  of  simple  structure  Is  equivalent 
to  the  fact  that  the  points  lying  on  a  simple  structure  plane  need  at  least 
one  variable  leas  in  their  description.  Thu6  if  p  variables  have  been 
measured  on  all  the  experimental  units  of  the  sample  (p-1)  variables  are 
adequate  in  the  case  of  those  experimental' units  which  lie  on  a  simple 
structure  plane.  In  the  case  of  these  experimental  units,  one  of  the  p 
variables  can  then  be  expressed  by  a  linear  combination  of  the  remaining 
(p-1)  variables.  The  statistical  interpretation  of  this  phenomenon  is 
considered  in  Chapter  3. 


CHAPTER  II 


PRINCIPAL  COMPONENT  AND  OTHER  PROJECTIONS 

2.1  Introduction 

As  stated  In  the  previous  chapter j  one  of  the  goals  of  the  present 

work  is  to  display  many  different  projections  of  multidimensional  clusters. 

In  this  chaptert  we  give  an  account  of  projections  along  eigenvectors  or 

principal  components  of  the  metric  ellipsoids  onto  2-dimenslonal  subspaces. 

The  principal  component  projections  are  necessarily  "orthogonal."  It  is 

worth  noting  at  the  outset  that  very  little  information  was  gained  by  these 

principal  component  projections.  Not  that  we  were  surprised  by  this 

finding,  but  some  social  scientists  seem  to  attribute  a  lot  more  to  this 

particular  mathematical  reference  frame  than  it  deserves. 

« 

2.2  Principal  Component  Projections 

Let  Z  ■  (Z  :  n  x  p)  be  a  data  matrix  of  the  multivariate  observa¬ 
tions.  The  n  data  points  can  be  represented  in  a  p-d  linens  ional  space. 

,  It  is  assumed  that  the  clusters  formed  by  these  n  points  have  been 
identified  as  well  as  the  points  belonging  to  the  clusters.  It  is  further 
assumed  that  points,  which  cannot  be  assigned  to  one  of  these  clusters, 
have  been  eliminated.  Now,  if  the  data  come  from  a  p-variate  normal  distri¬ 
bution  (or  a  mixture  of  p-variate  normal  distributions  with  different  mean 
vectors  but  the  same  variance-covariance  matrix),  the  clusters  would  be 
elliptical  in  shape,  and  the  ellipsoids  would  have  the  same  orientation. 

If  there  are  k  clusters,  and  if  we  denote  by  •  *  •  >  the 
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duster  centers ,  the  equations  of  the  ellipsoids  enclosing  these  clusters 
can  be  written  as 

t~_l 

(x  -  ]^)  Z  (jj,  -  1^)  ■  constant 

for  i  1,  2 . k,  where  x  stands  for  the  running  coordinates.  The 

points  belonging  to  a  particular  cluster  will  be  enclosed  within  the 
respective  ellipsoids.  In  practice,  since  the  population  (or  papulations) 
from  which  the  Bample  was  drawn,  will  not  be  exactly  normal,  we  will  not 
expect  all  the  data  points  belonging  to  a  particular  cluster  to  lie  vithin- 
the  corresponding  ellipsoid.  However,  unless  the  population  is  far  from 
normal,  the  ellipsoidal  fit  will  be  quite  good.  To  visualize  how  the 
points  are  distributed  in  p-dimensional  space  and  how  they  look  in  relation 
to  their  enclosing  ellipsoids,  we  need  to  project  the  points  onto  several 
2-dimensional  spaces.  The  technique  is  simple  and  classic  and  is  spelled 
out  here  for  the  sake  of  completeness.  Let  us  take  the  equation  of  the  ith 
unit  ellipsoid1, 

(*“%>'  S"1  <*  “%>-  1  (2*2.1) 

Since  E  is  a  positive  definite  (  or  at  least  positive  serai-definite)  matrix. 
'  there  exists  an  orthogonal  matrix  Q  such  that 

Q'  E  Q  -  Dx  (2.2.2) 

where  is  a  diagonal  matrix  with  diagonal  elements  equal  to  the  character¬ 
istic  roots  of  Z,  and  Q  is  the  matrix  of  the  eigenvectors  of  E.  (2.2.2) 

^his  generalizes  the  "standard  unit  interval"  in  univariate 
analysis. 
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can  be  written  «w» 


and  hence 


I  -  Q  Ox  Q' 


S-1  -  «  D1/x  q 


(2.2.3) 


(2.2.4) 


where  is  the  inverse  of  D^.  (The  reciprocals  of  the  characteristic 
roots  of  Z  are  the  diagonal  elements  of  D,^).  Substituting  (2.2.4)  into 


(2.2.1) ,  we  have 

(x  -  1^)'  Q  D1/x  Q*  (*  “  %)  -  1 

or,  if  we  let  »  (x  -  jj^)  1  Q,  this  reduces  to 
*'i  Dl/X*t  ’  1 


(2.2.5) 


(2.2.6) 


(2.2.6)  is  the  standard  principal  axes  reduction  of  the  conic  (2.2.1).  The 
transformation  yj  ^  ■  (x  -  jj^)  '  Q  is  the  orthogonal  rotation  of  the  original 


reference  axes  in  the  direction  of  the  principal  axes  of  the  ellipsoids. 

The  direction  cosinus  of  the  principal  axes'  of  all  the  k  ellipsoids  ore 

...  ,  .  t 

identical  because  of  the  assumption  of  equal  metric  (homogeneity  of  disper¬ 
sion  matrices).  If  we  write  the  matrix  Q  as 

Q  **  ts.xs  S-2’  *  '  ' 

where  Ja1,  a2,  .  •  •  ,  are  the  eigenvectors  of  E,  the  transformation 
=  ~  *  Q  can  be  written  as 

X’i  B  O'  "  1^)  '  t%»  <L2 . • 

Since  Q  is  orthogonal,  2^  !)  will  be  coordinates  of  an  original  data 
point  ,  which  is  the  ith  row  of  the  data  matrix  Z,  with  reference  to 
the  new  coordinate  axes.  In  particular  represent  the 
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orthogonal  projection  of  the  original  data  point  _z:  ^  onto  Lite  2-uiui«su- 
sional  plane  determined  by  the  eigenvectors  and  Let  us  assume  thaL 

the  eigenvectors  are  arranged  in  descending  order,  l.e.,  Is  the  eigen¬ 
vector  corresponding  to  the  largest  root,  ^  that  corresponding  to  the 
second  largest  root,  etc.  Then  the  2-dimensional  plane  determined  by 
and  Is  the  plane  containing  the  two  largest  principal  axes  of  the 
ellipsoids  and  we  would  be  projecting  the  data  points  onto  this  plane. 

We  can  take  all  the  (|)  ■  p(p-l)/2  pairs  of  eigenvectors  and  project  the 
data  points  on  these  planes. 

2.3  Principal  Component  Displays 

The  IBM  2250  Graphics  terminal  was  used  to  display  projections  on 
the  p(p-l)/2  2-dimensional  planes  formed  by  each  pair  of  the  eigenvectors. 
•M-'me  data  used  t»y  Graney  HOJ  were  taicen,  anc  the  three  clusters  xnentitiea 
by  him  were  projected  on  all  the  three  2 -dimensional  pairs  formed  by  the  3 
variables.  The  unit  ellipses  (i.e. ,  projections  of  the  unit  ellipsoid) 
around  the  clusters  were  also  displayed.  The  orientation  of  these 
ellipses,  is,  in  the  case  of  principal  components,  of  course  parallel  to 
the  axes  of  reference.  If,  however,  one  of  the  axes  is  not  a  principal 
,  component,  the  inclination  of  the  principal  axes  of  the  ellipses  with  the 
reference  axes  can  be  displayed.  This  inclination  may  present  some  evi¬ 
dence  regarding  the  nature  of  the  data  points,  which  was  obscured  in  the 
principal  component  plot.  Because  of  the  disappointing  lack  of  informa¬ 
tion  contained  in  the  principal  component  plots,  we  did  not  even  bother  to 
document  the  computet  program.  Instead,  the  program  which  has  been  docu¬ 
mented  permits  projection  around  arbitrary  pairs  of  reference  axes.  These 
computer  programs  are  described  in  detail  in  Chapter  V. 
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2.4  Other  Projections 

Let  the  equation  of  the  n-dimensional  unit  ellipsoid  be 
-  H*)’  -  1 

where  Z  is  the  p  x  p  variance-covariance  matrix,  x  are  the  running  coordi¬ 
nates  and  ^  is  the  cluster  center  of  the  ith  cluster.  There  will  be  as 
many  ellipsoids  as  the  number  of  clusters  identified.  For  the  sake  of 
notatlonal  convenience,  we  will  drop  the  subscript  i  from  ji^.  As  all  the 
ellipsoids  are  referred  to  the  same  metric  Z~*,  they  differ  only  with 
respect  to  their  location.  In  the  discussion  tht:t  follows,  we  are  con¬ 
cerned  with  the  shape  rather  than  the  location.  The  matrix  E,'as  a  popu¬ 
lation  parameter,  is  unknown  and  we  will  replace  it  by  its  unbiased  esti¬ 
mate,  s,  the  matrix  of  mean  squares  and  mean  products  within  groups.  The 
projection  on  the  2-dimensional  plane  formed  by  the  variables  i  and  j  can 
be  written  as 

8ll(x1  -  v±)2  +  sJJ(Xj  -  Wj)2  +  2  siJ(xi  -  y^C-Cj  -  Vj)  -  1 

where  s^  is  the  (i,  j)th  element  of  s~^.  These  equations  for  various 
values  of  i  and  j  are  employed  in  displaying  the  projections.  For  the  pur¬ 
poses  of  computer  programming,  this^equation  is  further  simplifjii;  i  i 

»liy12+  sJJyj2  +  2si;,y1yJ  -  1 

where  =  y^  and  -  Vj  **  y^ .  By  employing  the  standard  reduction 

techniques,  the  coordinates  to  plot  the  ellipses  can  be  easily  calculated. 
Further  aspects  on  this  phase  of  the  computer  program  are  treated  in 
Chapter  V. 
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-  -•  CHAPTER  III 

CLUSTERS  IN  SUBSPACES— THE  SIMPLE  STRUCTURE  SUBSPACE 

3.1  The  Problem 

The  previous  two  chapters  considered  the  problem  of  cluster 
Identification  and  cluster  configuration.  We  now  come  to  the  second 
part  of  our  inquiry — identification  of  experimental  units  or  points 
which  could  be  described  by  measuring  a  fewer  number  of  variables  on 
them.  Before  we  proceed  further  with  the  identification  of  the  points 
lying  on  a  subspace,  we  want  to  consider  the  situations  where  such  a 
problem  can  arise.  A  familiar  example  would  be  one  of  medical  diagnosis. 

A  number  of  patients  are  measured  on  a  number  of  medical  symptoms. 

Sometimes  these  measurements  are  repeated  on  the  same  patients  for  a 
number  of  days  consecutively.  Here  the  symptoms  measured  are  our  vari¬ 
ables  and  the  patients  are  subjects  or  experimental  units.  If  the  measure¬ 
ments  are  taken  on  different  days,  it  would  add  a  factor  of  classification; 
let  us  assume  that  this  factor  (i.e.,  days)  has  not  been  recorded.  Of 
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course,  from  these  data  one  can  construct  a  variance-covariance  matrix  and 
from  that  obtain  a  correlation  matrix.  This  could  be  subjected  to  factor 
analysis.  Factor  analysis  would  reveal  groupings  of  the  symptoms  in  these 
data.  Application  of  Thurstone's  [36]  principle  of  simple  structure  could 
reveal  such  groupings.  Disappointing  examples  of  this  kind  of  analysis 
resulting  in  the  symptoms  like  "chill,"  "fever,"  and  "pain"  are  not  so 


uncommon. 
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The  data  could  also  be  subjected  to  cluster  analysis  to  form 
groups  of  patients.  In  this  type  of  analysis,  the  patients  would  be 
classified  into  groups  depending  upon  the  absence  or  severity  of  symptoms 
they  have  in  common  with  respect  to  other  patients  belonging  to  the  same 
'group.  Thus,  for  the  patients  belonging  to  the  same  group,  almost  all 
the  patients  would  be  measuring  equally  on  the  average  on  different 
symptoms;  they  may  either  measure  high  on  the  same  symptom  compared  to 
other  patients  belonging  to  different  groups  or  measure  low,  etc.  To 
summarize,  factor  analysis  would  tell  us  about  the  symptoms,  and  cluster 
analysis  would  help  us  to  group  patients  according  to  the  "degree  of 
severity,"  say,  of  the  symptoms.  According  to  cluster  analysis,  we  may 
have  two  patients  belonging  to  different  groups  perhaps  because  one 
measured  low  on  one  symptom  and  the  other  measured  high  on  the  same 
symptom.  It  may  happen  that  the  particular  symptom  would  have  left  the 
final  diagnosis  unchanged.  To  this  extent,  this  symptom  could  be  dis¬ 
carded  so  far  as  these  two  Individuals  are  concerned  and  then  they  will 
belong  to  the  same  "group."  But  neither  the  factor  analysis  nor  cluster 
analysis  would  bring  out  this  fact;  nor  would  it  bring  out  the  fact  that 
"days"  is  a  hidden  factor.  We  must  extend  both  techniques  before  we  can 
identify  such  configurations.  The  problem  is  formulated  in  mathematical 
terms  in  the  next  section  where  we  present,  in  detail,  a  specific 
approach. 

3 . 2  Mathematical  Formulation 

In  previous  chapters  we  have  dealt  with  the  techniques  of  cluster 
analysis.  T.t  was  pointed  out  that  given  an  n  x  p  data  matrix,  it  is 
possible  to  identify  the  underlying  clusters.  We  now  ask  the  next 
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question — are  there  any  data  points  which  instead  of  lying  in  the  p- 
o  *er.»icr.nl  space,  lie  on  Cue  (p-1)  dimensional  hyperplane?  This 
question  is  important  as  an  answer  to  It,  among  other  things,  will  reveal 
the  following  things:  (1)  The  points  that  lie  on  the  (p-1) -dimensional 
hyperplane.  The  determination  of  the  points  lying  on  the  (p-l)-dimen- 
sional  hyperplane  will  help  us  to  describe  these  points  in  terms  of  (p-1) 
variables  Instead  of  p  variables,  (ii)  We  will  essentially  devise  a 
“discriminant  function"  which  helps  us  to  split  the  original  sample  into 
two  groups  of  points — one  group  which  needs  all  the  p  original  variables 
to  describe  the  points  belonging  to  it  and  another  group  which  needs  (p-1) 
instead  of  p  variables  to  describe  the  points  belonging  to  It.  In  the 
second  case,  we  have  also  to  consider  the  question  of  which  variable  to 
discard  from  the  original  p  variables.  Note  that  it  is  also  implied  in 
the  second  case  that  the  p  x  p  variance-covariance  matrix  of  the  points 
lying  on  the  (p-1) -dimensional  hyperplane  will  be  singular,  as  its  rank 
will  be  (p-1)  and  not  p. 

,3.3  A  Basic  Probability 

At  the  outset,  let  us  make  clear  what  we  mean  by  points  lying 
"approximately"  in  a  subspace.  Our  attempt  will  be  to  look  for  those 
points  which  lie  close  to  a  (p-l)-dimensional  hyperplane  in  a  p-dimensional 
space.  Clearly,  any  (p-1)  vectors  always  lie  exactly  on  a  (p-l)-dimensional 
hyperplane,  whereas  a  minimum  of  p  vectors  is  necessary  to  overdetermine 
a  (p-l)-dimensional  hyperplane.  We  will,  therefore,  look  for  those  (p-l)~ 
dimensional  hyperplanes  which  have  p  or  more  vectors  lying  close  to  them. 

The  singular  case  where  p  or  more  vectors  lie,  exactly,  on  a  plane  will  be 
ignored,  since  in  this  instance,  those  points  lying  on  the  hyperplane  will 
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satisfy  a  linear  relationship  between  the  p-variables;  this  cannot  happen 
unless  there  is  a  deterministic  linear  relationship  between  the  p-variables 
in  the  population,  and  thus,  any  sample  will  reflect  this  relationship  and 
the  p  x  p  estimated  variance-covariance  matrix  of  any  sample  from  such  a 
population  will  be  singular.  Since  we  require  inverses  of  dispersion 
matrices,  we  must  exclude  these  redundancies.  To  determine,  whether  a 
point  overdetennines  a  hyperplane,  we  shall  consider  its  Euclidean  distance 
from  the  hyperplane  and  examine  whether  this  distance  could  be  considered 
negligible  in  a  probabilistic  sense.  We  need  an  expression  for  this 
probability.  The  derivation  follows  the  reasoning  given  by  Bargmann  [2  ]. 
Let  P  be  a  point  in  2-dimensional  space.  We  are  interested  in  the  ortho¬ 
gonal  distance  of  this  point  from  a  line,  which  is  a  hyperplane  in .2- 
diroentiional  case.  Without  loss  of  generality,  we  can  assume  the  X-axis  to 
be  this  line  (Figure  3.3.1).  Let  the  vector  OP  subtend  an  angle  0  at  the 
origin  with  the  X-axis.  We  must  now  assume  that  the  reference  axes  span  a 
Cartesian  frame  (orthogonal,  equal  units  along  each  axis).  This  implies 
that  a  Gram-Schmidt  transformation  of  the  original  observations  (using  S, 
the  within  estimated  dispersion  matrix  as  the  original  metric)  must  precede 
the  calculation  of  probabilities  of  overdetermination.  With  this  assumption 
we  can  now  draw  a  circle  with  OP  as  radius .  Let  M  be  the  foot  of  the  per¬ 
pendicular  drawn  from  P  on  the  X-axis.  Then 
PM  »  OP  ♦  sin  0 

Therefore,  0  «  sin  ^FM/OP  =  sin  ^2PM/20P  *»  sin  *a/2h 

Where  a  «  PP'  and  h  is  the  radius  of  the  circle.  Thus,  the  probability  that 
a  point  falls  within  the  angle  0  on  the  circumference  of  the  circle  is  given 


p  m  2  x  length  of  arc  generated  by  angle  20 
2  circumference  of  the  circle 

-  2*OP.20/2II*OP 

-  20/n 

•*  (2/Il)arcsin(a/2h)  (‘>3.1) 

We  can  view  the  above  probability  either  way — 

(1)  the  probability  of  a  point  falling  within  an  angle  0  as  stated  above 
(il)  the  probability  of  a  point  falling  within  the  orthogonal  distance  of 
+a/2  to  -a/2. 


The  technique  involved  in  generalizing  the  above  probability  to  higher 
dimensions!  say  k,  is  simple.  We  have  to  obtain  the  surface  element  of  a 
sphere  of  radius  h  in  k  dimensions  and  the. surface  elemenv  cut  off  by  the 
k-dimensional  arc  which  subtends  an  angle  0  at  the  center  of  the  sphere. 
We  shall  illustrate  the  procedure  for  3-diraensions  before  generalizing  it 
to  k-dlmensions .  The  3-dimensional  sphere  can  be  obtained  by  revolving  a 
semicircle  around  its  diameter.  The  surface  element  generated  by  arc  be¬ 
tween  y  «-a/2  and  y  **  a/2  is  twice  the  element  generated  by  arc  between 
y  ■>  0  and  y  ■  a/2  and  hence  the  3-dimensional  surface  element  between 
y  »  -a/2  and  y  =>  +a/2  can  be  expressed  as 


a/2 

“  2<  2nf  x<*s 


where  x  » 


and  ds 


a/ 2 

Hence  Ca  -  2*2IlJ  <h 


y2) 


dy 
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and  S^,  the  total  surface  element  will  be 
h 

2*211  J  (h*  -  y 2)/(»^  -  y2)  dy 

o 

-  4Hh2 

Hence,  -  C^/Sg  *  a/2h  (3.3.2) 

For  k  dimensions,  the  (k-1) -dimensional  semisphere  Is  required  to  be 
revolved  around  a  (k-2)-dimensional  hyperplane.  In  the  above  notation, 
the  probability  can  be  expressed  as 


h 


f  (X2  -  y2)k"3  dy 
o 

In  the  numerator  of  the  above  expression,  substitute  y  ■  hsin<J>.  Then, 
dy  ■  hcos4>d$  and  the  integral  reduces  to 

a/2 

hk"2J  cosk“2$d<$ 
o 

where  a/2  »  arcsin  a/2h.  By  the  same  substitution,  the  denominator  could 


be  reduced  to 
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n/2 

.k-2  f  k-2  . 
h  j  cos 

o 

2  *' 

If  we  let  sin  #  ■  z,  this  Integral  can  be  reduced  to  the  complete  Beta 

integral 

2*(f’¥)  -if1’'1'2  a-S)<k-3,/2dS 

'  o 

and  the  denominator  could  be  reduced  to 
sin2a/2 

o 

Therefore,  P^,  the  ratio  cun  be  expressed  as 

Pu  -  B(sln2a/2;  1/2,  (k~l)/2)  (3.3.3) 

where  the  right  side  of  (3.3.3)  is  the  incomplete  Beta  function.  This  is 
the  probability  of  a  point  falling  within  («a/2,  +a/2)  in  k  dimensions. 

We  will  now  make  use  of  this  probability  expression  in  studying  the  Blmple 
structure  configuration. 

3 . 4  Determination  of  Simple  Structure  Configuration 

Let  us  assume  that  we  have  a  total  of  n  points  in  p-dimensional 
space.  As  stated  in  the  previous  section,  the  probability,  in  p-dimensional 
space,  of  a  point  falling  within  (-a/2,  a/2)  is  given  by 

P  «  B(sin2a/2;  1,  <p-l)/2)  (3.4.1) 

p  2 

where  a/2  »  arcsin  a/2h.  Without  loss  of  generality,  h  could  be  taken  to  be 
1.  We  are  interested  in  determining  whether  these  are  points  lying  on  a 
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flubspace.  In  a  p-dimensional  space,  (p-1)  points  always  lie  on  a  (p-1)- 
dimensional  subspace.  Thus  out  of  a  total  or  n  points,  we  have  avail 
able  only  (n-pfl)  free  points.  Likewise,  if  we  are  to  make  any  statement 
about  r  points  lying  on  a  subspace,  we  can  only  look  to  (r-p+1)  points  as 
(p-1)  of  them  will  always  lie  on  a  subspace.  If  we  denote  by  Pp  the 
probability  of  a  single  point  lying  within  a  fixed  distance  ia/2  from  a 
given  (p-1) -dimensional  hypersphere,  in  p-space,  the  probability  that 
(r-p+1)  points  will  fall  in  that  region  out  of  (n-p+1)  is  given  by 


r-p+1 


pr-p+l  1_pn-r 


Thus,  if  a  region  Is  to  contain  more  than  r‘ points,  the  probability' would 

he  crfvp.n  hv  the  cumulative  binomial  distribution 


i  O 


(3.4.2) 


Also  on  the  basis  of  the  (n-p+1)  free  points  available,  the  expected 
number  of  points  (in  excess  of  (p-1)  falling  in  a  (p-1) -dimensional  region 


(n-p+1) P 


(3.4.3) 


The  sum  of  binomial  terms  (3.4.2)  can  be  evaluated  by  the  incomplete  Beta 
function.  The  result  can  be  summarized  as 


Ptr  or  more  of  n  points  lie  within  ±a2] 
B  B(Pp;  r-p+1,  n-r+1) 


(3.4.4) 


where  itself  is  an  incomplete  Beta  function.  If  this  probability  is 
very  small  we  shall  say  that  the  simple  structure  subspace  determined  by 
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Che  points  lying  on  1c  Is  overdetermined.  However,  the  probability  of  a 
subspace  being  overdetermlned  Itself  depends  on  the  probability  F^.  We, 
theretore,  need  to  consider  the  Interval  width  and  the  probabilities,  so 
that  the  "prohahi H.ty  of  overdetermination"  may  be  a  meaningful  concept. 
Bargmann  [2],  in  studying  the  overdetermlned  subspaces  in  relation  to 
factor  analysij  fixed  the  ratio  a/h  to  be  ±0*10.  These  criteria,  which 
reflect  common  usage  In  factor  analysis,  did  not  suit  our  requirements. 

The  object  of  this  test,  as  will  be  discussed  In  Chapter  IV,  Is 
to  suggest  to  the  viewer  of  a  graphics  display,  some  vectors  which  are 
normal  to  overdetermlned  hyperplanes.  If  no  such  concentration  were  present, 
the  expected  number  of  points,  would  be,  (according  to  (3.4.3)  and  the  dis¬ 
cussion  on  free  points) 

(n-p+l)P_  +  (p-1)  (3.4.5) 

** 

After  considerable  experimentation,  with  n  between  20  and  50,  and  p 
between  2  and  5,  we  found  that  taking  such  that  the  expected  number  of 
points  is  (p+5),  i.e., 

Pp  ■  6/(n-pfl)  (3.4.6) 

gave  satisfactory  results  in  the  identification  of  overdetermlned  hyper¬ 
planes  by  the  Beta  test  (3.4.4).  A  value  of  P^  smaller  than  this  was  too 
stringent  so  that  a  considerable  number  of  points  would  have  to  lie  on  a 
subspace  before  it  could  be  considered  well  determined.  The  suggested 


tightening  up  the  probability  level  for  declaring  a  subspace  to  be  over- 

determined.  We  set  this  probability  at  0.01. 

The  arbitrariness  of  choice  of  P  and  levels  of  significance  may 

P 

be  disquieting  to  some  readers.  They  may  be  reminded  though,  that  we  are 
dealing  with  a  phase  of  data  analysis  which  is  exploratory.  A  sample  from 
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a  p-variate  normal  distribution,  with  a  dispersion  matrix  of  rank  (p-1) , 
will  always  be  on  a  (p-1) -dimensional  subspace,  exactly.  There  is  no 
test  for  a  hypothesis  in  the  population.  Rather,  we  deal  with  an  Instance 
where,  after  some  linear  trails  formations  of  the  original  variables,  one  of 
them  has  negligible  variance,  after  some  points  have  been  deleted  from  the 
sample.  Consequently,  a  decision  as  to  what  is  negligible,  and  what  is 
"close  to  a  plane"  is  really  as  arbitrary  as  declaring  that,  viewed  from 
some  point  in  the  universe,  most  (but  not  all)  of  the  stars  of  a  galaxy  lie 
close  to  a  plane.  In  the  final  analysis ,  only  the  graphic  display  of 
certain  projections  will  reveal  such  intuitive  configurations. 

In  our  examples,  the  subspaces  were  overdetermined  at  much  smaller 
values  than  0.01  which  points  to  the  fact  that  (3.4.6)  was  quite  useful. 

For  the  rest,  the  choice  of  critical  values  is  as  arbitrary  as  "0.05  level" 
'“’’(because  we  have  five  fingers?)  and  the  0.01  level  of  significance.  As  a 
guide  for  displaying  configurations,  our  two  levels  of  and  a  were 


"useful." 
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CLUSTERS  IN  SUBSPACES— IDENTIFICATION 

4.1  Introduction 

In  the  previous  chapter,  we  addressed  ourselves  to  the  statistical 
aspects  of  well  determined  subspaces  and  derived  a  few  pertinent  geometric 
probability  expressions  which  may  guide  us  to  find  such  spaces.  No  men¬ 
tion  was  made  of  techniques  for  finding  such  configurations.  In  the 
present  chapter,  we  consider  (a)  the  technique  of  determining  points  lying 
on  a  subspace;  and  (b)  the  problem  of  which  variable  could  be  discarded  for 
the  points  which  describe  an  overdetermined  subspacc. 

A . 2  Overdetermined  Subspaces 

The  problem  of  determining  subspaces  in  our  case  is  rather  similar 
to  the  determination  of  simple  structures  in  factor  analysis.  The 
essential  difference,  however,  lies  in  the  fact  that  a  factor  analyst  looks 
for  simple  structure  among  variables.  From  data  points,  he  constructs  a 
correlation  matrix  and  gets  a  factor  matrix  by  applying  one  of  the  many 
suitable  techniques  available  for  this  purpose.  If  he  so  desires,  after 
obtaining  an  initial  solution  of  the  factor  matrix,  he  may  obtain  a 
"preferred"  representation,  e.g.,  Lawley's  form  [14],  etc.  For  a  factor 
analyst  such  a  solution  may  not  serve  his  purpose  if  he  is  interested  in 
relating  the  artificial  variables  to  observable  ones.  In  that  case  he 
will  have  to  resort  to  some  other  forms  of  representation,  such  as  the 
simple  structure  technique.  The  number  of  artificial  variables  required 
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to  explain  an  observable  variable  is  known  as  the  complexity  of  the  ob¬ 
servable  variable.  For  the  purpose  of  interpretation  of  artifiaial 
variables,  it  is  desirable  that  complexities  of  observable  variables  be 
low.  Both  analytic  and  geometrical  techniques  are  available  to  a  factor 
analyst  to  express  the  final  solution  in  a  form  suitable  for  interpre¬ 
tation. 

We  have  a  slightly  different  problem.  First  of  all,  we  do  not 
look  for  any  artificial  variables  to  represent  the  observable  variables. 
Thus  whereas  a  factor  analyst  works  on  a  factor  matrix,  we  work  on  the 
data  matrix  Itself.  The  starting  point  for  a  factor  analyst  is  the 
correlation  matrix  obtained  from  the  data  matrix.  A  somewhat  similar 
standardization  is  employed  in  our  case,  except,  that  we  standardize  the 
data  matrix  on  the  basis  of  cluster  means  and  the  "within"  matrix  and 
then  normalize  the  points  to  unit  length.  After  displaying  these  points, 
and  the  unit  ellipses  around  the  cluster  means,  on  all  2-dimensional 
directions,  we  proceed  to  single  out  those  points  which  could  be  described 
in  terms  of  fewer  variables.  In  this  connection,  it  does  not  concern  us 
how  far  apart  these  points  are,  as  long  as  they  lie  on  a  subspace  of 
lower  dimensionality.  Thus  this  technique  has  an  advantage  that  it  can 
identify  points  lying  on  a  subspace  even  though  they  may  be  belonging  to 
different  populations,  or  clusters.  This  is  precisely  what  we  had  in 
mind.  The  technique  of  cluster  analysis  assigns  points  to  the  populations 
to  which  they  belong.  Leaving  this  structure  intact,  our  new  technique 
determines  points  which  lie  on  a  subspace. 

A  discussion  may  be  in  order  regarding  the  number  of  subspaces 
one  can  find.  If  we  can  determine  one  overdetermined  plane,  the  chances 
are  that  there  are  many  planes  in  the  vicinity  of  a  plane  already  found. 
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The  reason  is  that,  by  "tilting"  the  plane  already  found,  we  can  still 
retain  many  of  the  points  belonging  to  the  original  subspace  found,  pick 
up  a  few  new  points  and  obtain  another  overdetermlned  plane.  However, 
we  can  reduce  the  multiplicity  of  subspaces  by  Ignoring  these  additional 
planes  found  which  lie  within  a  certain  range  of  the  original  plane.  This 
can  be  done  by  requiring  a  minimum  angle  between  the  normals  to  two  dis¬ 
tinct  planes.  What  angle  should  be  maintained  between  two  planes  before 
declaring  them  as  distinct  is  a  matter  of  choice.  The  "orthogonality" 
preferred  by  some  factor  analysts  is,  at  lea'fet  for  our  problem,  quite 
useless. 

4.3  Identification  of  Subspaces 

In  this  section,  the  general  Identification  technique  will  be 
ueuciioea.  net  tnere  De  n  experimental  units,  each  with  p  measure¬ 
ments.  The  data  .matrix  of  order  n  x  p  will  be  designated  as  X.  This 
matrix  is  subjected  to  a  cluster  analysis  program.  If  the  data  are 
normally  distributed,  we  need  to  use  virtual  clusters;  hence  we  used  the 
program  developed  by  Bargmatm  and  Graney  [5]  for  this  purpose.  After  NG 
(computer  program  notation)  such  clusters  have  been  found,  we  may  define 
a  matrix  A,  with  elements  a^,  of  order  n  x  NG  such  that 

■  1  if  unit  i  belongs  to  cluster  J 

■  0  otherwise  (4.3.1) 

Let  (of  order  NG  x  NG)  denote  a  diagonal  matrix  with  elements  , 

J  -  1,  2,  .  .  .  ,  NG  where  k^  is  the  number  of  points  assigned  to  cluster 
j.  Subtraction  of  cluster  centers  or  cluster  means  from  each  point  pro¬ 
duces  a  matrix  Y,  formally  given  by,  , 


<*- — 
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Y  -  X  -  AD^A'X  (4.3.2) 

-  (I  -  (4.3.3) 

Now  we  can  obtain  an  estimate  of  the  common  dispersion  matrix  £,  using 

the  within-cluater  sample  dispersion  matrix 

S  -  (1/n  )Y'Y  (4.3.4) 

e 

where  n  ■  n-NG.  For  the  determination  of  subspaces,  we  must  first 

© 

transform  our  reference  frame  to  a  Cartesian  metric  (which  is,  of  course, 
merely  a  computational  device,  and  never  actually  displayed).  As  a  con¬ 
venient  technique,  we  used  the  Gram-Schmidt  ("Forward  Doolittle") 
reduction, 

S  -  TT'  (4.3.5) 

where  T  is  a  lower  triangular  matrix  with  positive  diagonal  elements,  hence 
unique.  In  terms  of  this  Cartesian  reference  frame,  the  data  matrix  is  now 
transformed  into 

Z  -  Y(T')*"1  (4.3.6) 

To  find  the  "simple  structure”  subspaces,  we  look  for  unit  vectors  _t,  such 
that 

Zt»v  (4.3.7) 

and  v  has  the  property  that  as  many  elements  as  possible  are  close  to 
aero  in  the  following  sense: 

Let  2^  denote  the  ith  row  of  Z  and  v^  denote  ehe  ith  element  of 
v.  Then  it  is  clear  that 

(4.3.8) 

The  ith  element  in  v,  namely  v^,  is  considered  close  to  zero  if 

<  sin 

where  sin  a/2  is  determined  by  (3.4.1). 


(4.3.9) 
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Let  us  now  consider  the  significance  of  the  expressions  (4.3.8) 
and  (4.3.9).  t  Is  a  unit  vector  and  so  is  zJ/ZzIz, .  Their  "inner 

1  — l- 1 

product,"  .sJt/Zzjz^,  therefore  it*  the  cosine  of  the  angle  between  the  vectors 
t,  and  But 

-  vi//z^ai  (4.3.10) 

and  hence  v^/Zz^z^  is  the  cosine  of  the  angle  between  the  vectors  t,  and 
z^.  Since  .t  is  unit  normal  to  the  subspace  (4.3.9)  is  established. 

For  a  given  number  of  experimental  units,  n,  and  p  measurements 
on  each  of  them,  we  can  determine  using  (3.4.6)  and  hence  sin  a/2  using 
(3.3.4).  (4.3.9)  is  then  a  test  to  determine  if  the  vector  corresponding 

to  a  given  data  point  (reduced  in  terms  of  z's)  is  close  to  the  subspace 
to  which  Jt  is  orthogonal.  If  the  vector  (and  hence  the  data  point)  is  close 
to  Lnc  suospace,  we  regard  Cue  point,  as  lulling  in  the  uveidei.eiuji.ncu  regiuu, 
and  treat  the  corresponding  element  of  v  as  "zero."  We  can  examine  each 
element  of  v,  in  this  manner  and  determine  how  many  "zero"  elements  are 
there.  It  is  the  count  of  these  "zero  elements"  which  we  subject  to  the 
Beta  test  (3.4.4).  If  this  test  is  significant,  we  say  that  the  subspace 
Is  overdetermined  and  report  it  as  a  solution,  provided  it  is  not  "close" 
to  any  subspace  already  found. 

The  transformation  vectors  are  found  in  a  manner  analogous  to 
Thurstone's  "Analytical  Method"  combined  with  his  earlier  "Single  Plane 
Method"  [36].  According  to  this  method,  each  row  of  the  reduced  data  matrix 
Z  is  used  as  a  point  of  departure  to  find  a  vector  t_.  Let  us  start  with 

the  ith  row  vector  zj .  We  shall  assume  that  z^,  z ^ . zlp  are 

elements  of  _z^.  Then  t^.  the  initial  approximation  to  t,  is  obtained  by 
normalizing  i.e., 
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K  -  *i'>«±  %  “  <col*  lo2 . V 


with  this  trial,  the  projections 


are  calculated  and  the  elements  are  tested  for  closeness  to  zero. 

If  a  value  is  very  close,  a  large  weight  is  assigned  to  the  point,  for  the 

subsequent  weighted  regression  technique.  If  it  is  large,  the  weight  may 

be  zero.  Following  Thurstone,  we  use  discrete  (step)  weights,  as  follows: 

If  0  <  v  <  sin  a/2 

*  BND  (BND,  bound,  initially  8). 

If  sin  a/2  <  v  . //z]  V  <  2sin  a/2 
oi  — i  — 1 

Wj,  “  BND-1 

If  v  ,//z!  i,  >  BND  x  sin  a/2 
oi  “X  "T. 

Wj  ■  0. 

With  these  weights  we  can  follow  a  weighted  regression  scheme  to  obtain  an 
improved  vector  t,^.  This  can  be  further  simplified  by  a  relation  (due  to 
Thurstone)  which  expresses 


t*  ,  fttlJJIl 
1J 


(4.3.11) 


where  u.  “  2  z . .v  . /w .  /  E  zf ,/w.  ,  .  . 

J  iml  «  oi  y  !■!  iJ  i 


(4.3.12) 


are  then 


elements  of  the  new  trial  vector  t^.  We  calculate 

■  z% 


and  once  again  the  weights  are  assigned  following  the  scheme  detailed  above. 
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However,  before  assigning  these  new  weigh La  the  2ND  vslus  is  reduced  to 
3,  in  the  next  cycle  to  2  and  then  kept  at  1  for  the  remaining  Iterations. 
Beginning  with  t^,  seven  iterations  are  performed  which  give  rise  to 
jt^,  Jt 2 »  •  •  •  ,  t,y.  The  last  one,  tj,  is  retained  as  t_  and  Zt  «  v 
is  formed.  If  the  number  of  "zero"  entries  in  this \v  is  significantly 
large  as  explained  earlier,  we  will  have  determined  a  well  defined  sub¬ 
space,  to  which  jt  is  normal.  The  entire  process  is  repeated,  with  each 
row  taken  as  a  trial  value.  A  given  row  may  or  may  not  identify  a  sub¬ 
space.  If  it  leads  to  an  overdeteralned  subspace,  the  solution,  except 
for  the  first  one,  is  checked  as  explained  in  section  4.2  to  ensure  that 
the  subspace  is  "different"  from  any  of  the  ones  already  found  from  previous 
rows.  For  this  purpose,  we  require  that  the  cosine  of  the  angle  between  two 
planes  should  not  exceed  0.7  meaning  that  the  planes  were  apart  by  at 
least  45®.  Once  again  we  would  like  to  mention  that  this  is  an  arbitrary 
requirement;  we  found  this  useful  in  our  experimental  studies. 

•  Now  we  must  retransform  to  the  original  data  frame.  The  vectors 

t:  that  we  obtain  above,  are  with  reference  to  the  matrix  Z.  Our  original 
objective  was  to  remove  mean  shifts  and  then  look  for  experimental  units 
which  lie  on  subspaces.  However,  in  working  with  Z,  we  have  removed  the 
mean  shifts  and  also  reduced  the  metric  to  an  orthogonal  Cartesian  frame. 
This  was  a  matter  of  convenience.  To  restore  the  original  metric,  we  must 
obtain  the  solution  in  terms  of  Y.  This  can  be  easily  obtained  as  under. 
(4.3.6)  is  the  transformation  of  Y  into  Z  arid  (4.3.7)  is  the  simple 
structure  solution  in  terms  of  Z.  Substituting  (4.3.6)  into  (4.3.7)  we 
obtain 

Vd'^t-v  (4.3.13) 

from  which  we  conclude  that  (T,)”*t  is  the  transformation  in  terras  of  Y. 


37 


The  computer  program  reports  both  J.  and  (T1)  The  vector  .t  is  reported 

under  the  heading  "Vector  which  transforms  original  factor  matrix  into  the 
above  plane  no.  v__M  and  is  reported  under  the  heading  "Transforma¬ 

tion  vector  to  transfer  raw  data  to  simple  structure."  The  corresponding 
elements  of  v  are  also  reported.  We  shall  also  refer  to  elements  of  v  as 
the  "loadings"  or  "scores"  of  original  points  with  reference  to  the  simple 
structure  plane  determined. 

For  each  t,,  only  the  vectors  (T')*"^t  are  conveyed  to  the  display 
program,  since  we  plot  the  original  data  points,  In  terms  of  the  observed 
variables.  As  well  be  explained  in  Chapter  5,  the  display  program  takes 
one  of  the  observed  variables  as  one  axis  and  some  specified  vector  as 
another  axis.  For  a  given  choice  of  a  variable  and  a  vector,  the  vector  Is 
reduced  in  such  a  manner  that  it  forms  an  orthogonal  frame  of  reference  with 
the  chosen  observed  variable.  A  question  may  be  raised  as  to  why  one  of  the 
axes  always  corresponds  to  an  observed  variable.  In  this  connection,  it 
should  be  pointed  out  that  a  vector  specified  by  the  user,  is  equivalent 
to  an  artificial  variable,  a  linear  combination  of  the  observable  ones.  The 
elements  of  the  vector  given  by  the  user  serve  as  weights  in  forming  this 
linear  combination.  When  we  view  the  displays  with  reference  to  an  ortho¬ 
gonal  frame  of  reference  consisting  of  an  observable  variable  against  an 
artificial  variable,  we  can  make  an  inference  as  to  how  an  observable 
variable  compares  with  an  artificial  variable.  If  both  reference  axes  were 
to  correspond  to  artificial  variables ,  the  displays  would  be  hard  to 
interpret  in  a  realistic  sense.  This  is  our  main  consideration  in  insisting 
that  one  of  the  axes  correspond  to  an  observable  variable.  Further,  if  we. 
should  permit  a  user  to  select  any  two  vectors,  these  could  be  translated 
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into  an  orthogonal  frame  of  reference  in  many  different  ways  leading  to 
utter  confusion. 

4.4  RUminntlon  of  Variables 

One  notable  difference  between  the  rotational  problem  in  factor 
analysis,  and  the  problem  presented  in  this  dissertation,  is  the  fact 
that  the  former  'focuses  attention  on  those  points  (variables,  in  that 
case)  which  are  far  removed  from  the  subspaces.  By  contrast,  the  latter 
pays  attention  to  only  those  points  which  are  so  close  to  a  subspace  that 
they  define  it.  It  is  these  data  points  only  which  require  one  variable 
less  for  adequate  description.  It  should  be  noted,  again,  that  such  a 
subspace  is  not  a  single  region  within  the  p-dimenslonal  space.  Rather, 
for  each  simple  structure  plane,  there  are  as  many  (parallel)  (p-1)- 
dimensional  hyperplanes  as  there  are  clusters.  Points  which,  in  this 
sense,  fall  into  the  same  subspace  may  be  far  removed  from  each  other, 
inasmuch  as  they  may  be  in  different  clusters.  But  even  with  their 
distinct  neighbors,  they  share  the  property  that  the  same  (p-1)  variables 
are  sufficient  to  explain  their  characteristics.  It  is  for  this  reason 
that  we  expect  entirely  new  principles  of  classification  of  data  points, 
different  from  what  could  be  expected  by  varying  or  refining  cluster 
analysis  or  factor  analysis  techniques. 

Mathematically,  we  could  use  for  description,  (p-1)  linear  combi¬ 
nations  of  the  original  p  variables,  with  the  combinations  chosen  within 
the  hyperplane  orthogonal  to  the  ‘t,  vector .  For  real  life  interpreta¬ 

tion,  this  procedure  would  be  useless.  Surely  we  are  better  off  with  p 
observable  varl  hies  than  with  (p-1)  artificial  ones.  The  principle  of 
parsimony  is  not  just  a  principle  restricted  to  dimensionality.  It  would 
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seem  reasonable,  then,  to  eliminate,  for  the  data  points  close  to  one  of 
the  subspaef  s,  that  observable  rsrishO?  uMch  contributes  the  least  to 
the  description  of  this  selection  of  data  points. 

As  a  measure  of  proximity  of  each  variable  to  the  artificial 
variable  which  defines  the  subspace,  we  propose  to  use  the  correlation 
between  the  original  variables  and  the  artificial  variable.  This  can  be 
obtained  as  follows.  Recall  that  Y  is  obtained  from  the  original  data 
matrix  after  subtraction  of  the  appropriate  cluster  means.  Hence 

i'Y  -  O'  .  (4.4.1) 

where  is  a  row  vector  consisting  of  all  l's  and  0/  is  a  null  row  vector. 


Also, 


i'v  -  1'Zt 

-  A'Ycn^t 

-  o' 

v'v  -  t'Z'Zt 


(By  (4.3.6)) 
(4.4.2) 


n  t'lt 
e—  — 


(4.4.3) 


since  t  is  a  unit  vector.  Bv  virtue  of  the  fact  that  Z'Z  *»  n  I ,  and 
—  •  e 

because  of  (4.4.1)  (1/n  )Y'v,  will  be  an  unbiased  estimate  of  the  covariances 
between  the  original  variables  and  an  artificial  one  on  which  the  data 
points  have  scores  which  arfe  the  elements  of  v_.  There  will  be  as  many  v 
vectors  as  there  are  over del. ermined  subspaccs  (each  corresponding  to  n 
different,  but  possibly  overlapping,  selection  of  data  points);  For  each 
of  these,  we  must  determine  its  correlations  with  the  original  variables. 
Thus,  if  there  are  4  variables  and  3  different  solutions  (ovordetornrined 
subspaces) ,  we  will  have  u  total  of  4  x  3  «*  12  correlations.  Now,  for  a 
given  v_  (n  given  subspace) , 


(1/n  )v'v  -  1  (4.4.6) 

G 

since  a  sum  of  squares  (v’v)  must  be  divided  by  degrees  of  freedom  to 
produce  a  variance  estimate.  Hence  to  obtain  the  correlations,  we  have  to 
divide  the  elements  of  Tt  by  the  square  root  of  the  estimate  of  the  vari¬ 
ance  of  the  observable  variable  only.  In  the  computer  program,  after  the 
ector  _t  is  obtained,  the  product  Tt  is  formed  and  the  correlations  are 
then  calculated  by  division  of  each  of  the  elements  of  Tt_  by  the  square 
root  of  the  estimate  of  the  variance  of  the  corresponding  observable 
variable.  The  estimates  of  the  variances  of  the  observables  are  still 


available  in  the  final  step  of  the  cluster  analysis  part  of  the  program 
and  these  values  are  stored  for  use  at  this  time.  T  is  also  stored. 
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Once  the  correlations  between  the  original  variables  and  their 
scores  with  reference  to  a  subspace  (i.e.,  vector  v)  are  available,  it 
Is  easy  to  determine  which  variable  should  be  discarded,  m  discrimi¬ 
nant  analysis,  we  come  across  the  concept  o£  correlations  between  original 
variables  and  the  discriminant  function.  The  discriminant  function  is 
nothing  but  a  linear  combination  of  the  original  variables  which  dis¬ 
criminates  best  between  experimental  units  in  a  specified  sense.  The 
purpose  for  which  we  want  to  discriminate  Is  important  as  the  discriminant 
functions  for  different  purposes  are  usually  different.  Given  these  things, 
the  variable  which  correlates  most  strongly  in  absolute  value  with  the 
discriminant  function  is  the  most  important  In  discriminating.  In  our 
study  the  vector  t.  which  transforms  the  original  observations  into  v  plays 
a  similar  role  in  the  sense  that  the  elements  of  a  well  defined  subspace 
represented  by  v, has  most  elements  near  zero  (see  section  4.3)  and  a  few 
far  removed  from  zero.  In  analogy  with  discriminant  analysis,  v,  is  the 
vector  that  produces  the  two  groups  of  data  points.  Thus,  the  variable 
which  correlates  most  strongly  with  this  artificial  variable,  contributes 
most  to  the  discrimination  process.  It  is  this  maximally  correlated  ob¬ 
served  variable  which  should  be  eliminated  for,  after  it  has  been  discarded 
the  other  variables  contribute  far  less  to  the  discrimination  between  these 
two  sets  of  data  points.  If  only  one  variable  is  to  be  sought  which  would 
explain  the  difference  between  these  data  points  which  fall  into  the  sub- 
space  and  those  that  do  not,  it  would  be  this  one  which  is  closest  (has 
highest  correlation  in  absolute  value)  to  the  expendable  artificial  variable. 
Note  the  exact  opposite  of  this  technique  to  discriminant  analysis,  where 
we  seek  the  best  discriminator.  Here  we  identify  the  "most  expendable" 
artificial  variable.  By  our  correlational  technique,  we  have  identified, 
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for  etch  aubspace.  .tot  observable  variable  which  1.  closest  to  the 

.  .  (  -  _  -  r  *k m  Hal  data 

artificial  variable  not  needed  In  tea  ueacripc...  ...  ....  . 

subset.  Hot.  again,  th.  need  for  this  correlational  Interpretation. 

If  It  were  argued  that  th.  artificial  variable  Itaalf  ought  to  be  dis¬ 
carded,  w.  would  be  left  with  Cp-D  artificial  u.rlbbl.s,  a  rather  un- 
aatlsfactory  situation.  After  the  correlational  approach,  we  have  (p-1) 
observed  ver tables  left. 


CHAPTER  V 


DESCRIPTIONS  OP  COMPUTER  PROGRAMS, 

5.1  The  Computer  Programs 

The  algorithms  explained  in  previous  chapters  have  been  synthesized 
into  two  computer  programs  which  are  available  at  the  University  of  Georgia. 
The  first  program,  named  CLUSTR,  and  described  in  the  next  section, 
Identifies  the  clusters  and  overdetermined  subspaces.  The  second  one  is 
available  as  a  conversational  system  for  the  IBM  2250  Graphics  unit.  In 
this  chapter,  we  describe  these  two  computer  programs.  The  following 
chapter  will  contain  instructions  regarding  the  use  of  these  programs,  and 
the  interpretation  of  graphical  displays. 

5.2  An  Algorithm  for  Identification  of 
Points  Lying  on  a  Subspace 

In  this  program,  beginning  with  the  data  matrix,  we  first  identify 
the  clusters.  This  is  essentially  the  algorithm  proposed  by  Bargraann  and 
Graney  [5].  However,  the  algorithm  proposed  by  Bargmann  and  Graney  stops 
at  the  identification  process.  Since  we  also  need  to  identify  points 
lying  on  an  overdetermined  subspace,  we  extend  the  algorithm  further.  A 
complete  listing  of  the  program  is  contained  in  Appendix  A.  This  program 
can  be  logically  divided  into  2  parts.  Up  to  statement  number-811,  it  is 
essentially  the  reproduction  of  the  program  developed  by  Bargmann  and 
Graney  [  5  J,  where  a  complete  documentation  of  this  part  can  be  found.  We 
have  made  a  small  change,  in  the  program  to  suit  our  needs.  As  explained 
in  Bargmann  and  Graney  [  5  1 ,  their  program  makes  three  passes  to  identify 
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clusters.  The  passes  made  In  their  program  are  controlled  by  the  state¬ 
ment  Immediately  following  statement  number  444.  In  our  program, 

CLUSTR,  this  has  been  replaced  by  the  transition  to  the  subapace-identlfl- 
cation  program.  Further,  the  program  developed  by  Bargmann  and  Graney 
[  5 ]  does  not  calculate  cluster  means  or  the  "within"  matrix  after  three 
passes.  Their  program  computes  these  quantities  at  the  beginning  of  the 
first  pass,  and  after  the  first  and  second  passes.  We  require  these 
quantities  for  our  search  for  the  points  lying  on  subspacea.  These 
quantities  are  calculated  in  statements  between  number  811  and  number 
20001.  At  this  stage,  we  also  punch  cards  containing  the  means  for  each 
cluster  and  each  variable.  These  will  be  needed  prior  to  the  execution 
of  ELLIPSE  described  below.  In  statements  between  number  11020  and  number 
10001,  we  standardize  the  original  points  and  reduce  them  to  zero  mean  and  a 
Cartesian  metric.  This,  then,  is  the  beginning  of  the  second  logical  part 
of  the  program.  Here,  we  determine  the  points  lying  on  subspaces,  using 
the  method  of  weighted  least  squares  together  with  Thurstone's  "Analytical 
Method"  and  the  "Single  Plane  Method."  The  algorithms  (including  the 
change  of  the  BND  variable)  have  been  presented  in  sections  4.2  and  4.3 

The  output  of  this  program  consists  of  two  parts — a  print  out  and 
a  punched  deck.  The  printed  output  contains  the  results  of  three  passes 
made  to  find  clusters.  The'  results  listed  after  the  third  pass  are  the 
final  results  relating  to  cluster  analysis.  It  shows  which  point  belongs 
to  which  cluster.  It  also  shows  at  what  level  the  points  got  included  in 
the  cluster.  The  second  part  of  the  printed  output  gives  various  simple 
structure  solutions.  It  gives  vectors  which  transform  the  original  obser¬ 
vations  into  simple  structure  plane3.  For  each  of  these  vectors,  the 
correlations  between  the  original  variables  and  the  "scores"  of  points 
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with  reference  to  this  vector,  the  number  of  points  falling  into  the 
simple  structure  plane  corresponding  to  this  vector,  and  the  probability 
of  these  many  points  falling  into  this  simple  structure  plane,  are  given. 

A  simple  structure  plane  is  not  included  as  a  solution  if  the  probability 
of  the  number  of  points  falling  into  this  plane  is  greater  than  0.01  or  if 
it  is  within  45s  of  a  plane  already  found. 

Apart  from  the  printed  output,  a  punched  deck  is  also  produced. 

These  are  data  cards  which  are  later  loaded  into  data  sets,  as  described 
below.  The  first  few  cards  in  this  deck-equal  in  number  to  the  number  of 
clusters  formed — are  the  cards  containing  cluster  means.  The  next  set  of 
cards  contains  vectors  which  transform  the  original  data  points  into 
simple  structure  solutions.  The  number  of  clusters  is  designated  by  NG 
and  the  number  of  simple  structure  solutions  is  designated  by  NSOL.  This 
program  also  reproduces  the  cards  for  each  data  point  with  the  follovring 
additional  information.  For  each  data  point,  the  card  contains  a  serial 
number  in  columns  1-3,  the  number  of  the  cluster  to  which  it  belongs  in 
column  4,  and  the  simple  structure  planes  in  which  it  is  included  in  columns 
5-14.  In  column  4,  a  '1*  is  punched  if  the  point  belongs  to  cluster  number 
1,  '2'  if  the  point  belongs  to  cluster  number  2,  etc.  'O'  is  punched  in 
column  4  if  the  point  was  not  assignable  to  any  of  the  clusters.  The  simple 
structure  planes  to  which  the  point  belongs  is  indicated  in  columns  5-14  as 
follows:  A  '1'  in  column  5  indicates  that  the  point  falls  into  simple 
structure  number  1,  a  '1'  in  column  6  indicates  that  the  point  belongs  to 
simple  structure  number  2,  etc.  (with  provision  for  up  to  10  solutions). 

This  is  certainly  more,  than  adequate  capability.  Zeros  or  blanks  in 
columns  5  to  14  (the  computer  program  punches  zeros)  indicate  absence  of 
this  point  in  the  corresponding  simple  structure  plane.  Columns  15-70 


contain  the  original  coordinates  of  each  data  point.  The  entire  punched 
deck  output  Is  also  printed  out.  The  duster  means  are  printed  at  the  end 
of  the  third  pass,  and  the  vectors  which  transform  the  raw  data  Into  simple 
structure  solutions  are  printed  immediately  after  the  printout  of  the 
corresponding  solution.  The  information  punched  Into  the  last  NF  cards 
of  the  punched  deck  is  also  printed  as  a  final  summary.  The  user  may 
find  it  helpful  to  keep  this  summary  with  him  while  he  studies  the  displays. 

5.3  Loading  the  Data  Set  (Utility 
Program  IEBGENER) 

After  the  user  has  subjected  his  data  to  the  CLUSTR  program,  he 
should  next  run  the  IEBGENER  program.  This  program  supplies  the  output 
of  CLUSTR  program  as  Input  to  the  ELLIPSE  program.  A  header  card,  con- 
talning  the  number  of  points,  the  number  of  variables,  the  number  of 
groups,  and  the  number  of  overdetermined  subspaces  identified  by  the 
CLUSTR  program,  Is  put  before  the  punched  deck  produced  by  the  CLUSTR 
program,  and  the  IEBGENER  program  is  executed.  A  sample  deck  set-up  for 
the  IEBGENER  program  Is  given  in  Chapter  VI;  this  Is  a  utility  routine 
which  transfers  the  cards  to  disk. 

5.4  The  Display  Program  and  the 
Conversational  System 

The  second  program  of  the  package  serves  to  display  projections 
of  the  clusters  and  subspaces,  on  an  IBM  2250  Graphics  Display  Unit.  It 
enables  a  user,  at  the  console,  to  communicate  with  the  system  and  to 
manipulate  displays  appearing  on  the  scope.  The  program,  named  ELLIPSE, 
is  capable  of  handling  up  to  8  variables,  50  data  points,  10  simple 

t 

structure  solutions  and  5  groups  or  clusters.  The  numbering  of  the 


47 

« 

variables  is  Implicit.  The  first  coordinate,  for  each  data  point,  is 
regarded  as  variable  number  1,  the  second  coordinate  as  variable  number 
2,  etc. 

'  The  program  is  an  interactive  one  in  the  sense  that  the  user,  at 
the  console,  can  decide  on  variations  for  later  displays  on  the  basis  of 
what  he  saw  in  the  earlier  ones.  The  input  of  the  data  is  so  formatted 
and  programmed  that,  if  a  user  wishes  to  reassign  a  point  from  one 
duster  to  another,  or  if  he  wishes  to  change  cluster  centers,  etc.,  he 
only  needs  to  make  a  change  to  this  effect  in  the  corresponding  dnta 
cards.  This  capability  gives  the  user  an  opportunity  to  redefine  clusters 
and  subspaces  on  the  basis  of  the  displays  generated.  The  interaction 
between  the  user  and  the  system  is  achieved  through  the  use  of  the  pro¬ 
grammed  function  keys  and  the  alphameric  keyboard  which  is  a  part  of  the 
2250  Graphics  Display  Unit. 

The.  program  Includes  a  main  program  and  8  subroutines .  The  main 
program  calls  two  subroutines  CALC  and  EXIBIT.  CALC  reads  the  entire  in¬ 
put  into  the  ELLIPSE  program.  The  input  consists  of  a  header  card  con¬ 
taining  the  number  of  points,  the  number  of  variables,  the  number  of 
clusters  and  the  number  of  simple  structure  solutions;  cluster  means  for 
each  variable;  vectors  which  transform  the  original  data  points  (raw  data) 
into  various  simple  structure  solutions;  and  the  data  points,  together 
with  information  regarding  the  cluster  to  which  a  data  point  belongs  and 
whether  or  not  it.  lies  on  a  given  overdetermined  subspace.  As  explained 
earlier,  this  input  is  given  to  the  program  through  the  execution  of  the 
IEBGENER  Utility  routine.  The  first  (executable)  statement  of  the  CALC 
subroutine  reads  the  header  card.  Following  this,  the  DO  14  loop  reads 
duster  means,  the  DO  114  loop  reads  the  transformation  vectors  and  the 
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DO  10  loop  reads  the  data  points.  The  array  IG  la  used  to  store  infor¬ 
mation  regarding  the  cluster  to  which  a  data  point  belongs.  Recall  that 
for  each  data  point,  columns  5-14  contained  *1'  to  Indicate  the  corres¬ 
ponding  simple  structure  planjko  which  this  point  belongs.  This  configu¬ 
ration  of  'l'a  and  '0's  is  read  as  a  10  digit  Integer  number  and  stored 
in  the  first  column  of  the  two-dimensional  array  INNSOL.  The  serial 
number  of  a  data  point  is  stored  in  the  second  column  of  INNSOL.  In  the 
DO  422  loop,  the  array  IG  is  examined  to  determine  the  size  of  each  cluster. 
The  DO  429  loop,  then,  determines  the  total  number  of  points  assigned  to 
clusters.  The  subroutine  C0R2  is  now  called  which  yieldB  the  within  sum 
of  squares  and  products  matrix  based  on  all  the  points  belonging  to 
clusters.  The  upper  triangular  part  of  this  symmetric  matrix  is  stored, 
columnwise,  in  the  array  DSPROD.  In  the  DO  434  loop,  each  element  of  the 
array  DSPROD  is  divided  by  n  ,  the  degrees  of  freedom,  to  yield  an  esti- 
mate  S  of  the  common  variance-covariance  matrix,  I.  Immediately  following 
the  statement  number  434,  SINV,  a  sub-routine  from  the  IBM  Scientific 
Subroutine  Package,  is  called  to  invert  the  matrix  S.  The  inverse  of  this 
matrix  S,  stored  as  the  first  column  of  the  two-dimensional  array  A,  is 
the  metric  (based  on  all  points  belonging  to  clusters)  employed  for  unit 
ellipsoids  around  the  clusters.  The  subroutnien  CALC  then  calculates 
uetrics  corresponding  to  overdetermined  subspaces.  For  each  of  the  over- 
determined  subspaces,  the  DO  426  loop  calculates  a  metric  corresponding 
to  each  of  the  overdetermined  subspaces  on  the  basis  of  the  points  be¬ 
longing  to  it.  The  inner  DO  427  loop  examines  the  10  digit  integer  num¬ 
bers  stored  in  the  first  column  of  the  array  INNSOL  to  determine  the 
points  lying  on  a  particular  subspace.  For  each  subspace,  the  subroutine 


Cuius  is  utilized  to  calculate  a  new  vithln-cluster  sum  of  squares  and 
products  matrix.1  In  the  DO  442  loop*  each  element  of  this  matrix  is 
divided  by  the  number  of  points  belonging  to  the  overdetensined  subspace, 
to  yield  an  estimate  of  the  variance-covariance  matrix  based  on  the 
points  belonging  to  the  subspace  only.  The  subroutine  SINV  is  then 
called  to  invert, this  matrix.  The  Inverse  matrix  is  used  as  the  metric 
for  ellipsoids  corresponding  to  the  o verde termlned  subspace.  The  metric 
corresponding  to  the  overdetermined  subspace  number  1  is  stored  as  the 
eecond  column  of  the  two-dimensional  array  A,  the  metric  corresponding  to 
the  overdetermined  subspace  number  2  is  stored  as  the  third  column  of  the 
array  A,  etc.  This  Is  accomplished  in  the  DO  433  loop.  The  DO  15  loop, 
beginning  at  statement  number  450,  calculates  the  maximum  and  minimum 
values  for  each  variable.  These  maximum  and  minimum  values  are  required 
later  for  scaling  purposes  to  accommodate  each  of  the  projected  data 
points  within  the  screen  limits.  A  flow  chart  for  the  subroutine  CALC  is 
given  in  figure  5.4.1. 

Subroutine  EXIBIT 

The  subroutine  EXIBIT  begins  with  a  call  to  the  DISPLA  subroutine 
of  the  GRAF  (Graphics  Addition  To  Fortran)  package  [16].  This  results  in 
setting  up  GDSX,  GDSY,  GTEXT,  GPOIN'T,  GDSE,  GDSER.  and  GINPUT  as  display 
variables.  The  subroutine  LIGHTS  of  the  GRAF  package  is  next  called  to 
turn  on  the  lights  corresponding  to  the  programmed  function  keys  numbered 
1  up  to  the  number  of  variables,  keys  27  to  29  and  31.  After  these  pre¬ 
liminaries,  the  subroutine  MESSGF,  is  called  to  display  an  informative 
message  about  the  program,  for  the  benefit  of  the  user.  The  subroutine 

i 

MESSGE  returns  the  control  to  the  subroutine  EXIBIT  as  soon  as  the  user 
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presses  any  programmed  function  key.  The  message  appearing  on  the 
screen  is  erased,  a  variable  NTEST  (used  later  to  register  tne  depressed 
key)  is  sec  initially  equal  to  1  and  a  variable  NGG  (a, flag  which  la  used 
to  indicate  change  of  vectors)  is  set  equal  to  0  and  the  control  goes  to 
statement  number  20.  This  statement  is  a  call  to  the  subroutine  KEY IN, 
with  NTEST,  the  input  argument  having  been  set  equal  to  1.  The  subroutine 
KEVIN  accepts  from  the  user  a  vector  and  the  number  of  a  variable,  in 
order  to  form  a  2-dimensional  plane.  The  user  Indicates  his  choice  of 
the  number  of  the  variable  by  pressing  the  corresponding  programmed 
function  key.  The  variable  NAXIS  is  used  as  an  output  argument  of  the 
subroutine  KEVIN  and  on  return  contains  the  number  of  the  programmed 
function  key  pressed  by  the  user.  Aa  will  be  seen  later,  on  return  from 
KEVIN,  the  variable  NAXIS  must  either  have  a  value  equal  to  the  number  of 
the  variable  the  user  wishes  to  utilize,  or  30.  If  NAXIS  equals  30,  it  is 
implied  that  the  user  wishes  to  stop  and  the  display  program  comes  to  an 
end.  Otherwise,  the  subroutine  ELLFSE  Is  called  with  the  current  value 
of  NAXIS  (the  number  of  the  variable)  as  input  argument.  The  subroutine 
ELLFSE  displays  the  projections  of  the  original  data  points,  and  the  unit 
ellipsoids  having  the  metric  based  on  all  the  points  belonging  to  clusters, 
onto  the  2-dimensional  plane  formed  by  the  vector  and  the  variable 
supplied  by  the  user,  provided  there  is  no  singularity  or  redundancy  in¬ 
volved  (see  ELLFSE  below).  After  the  above  displays  appear,  the  user  is 
expected  to  press  a  programmed  function  key.  If  he  presses  key  29  or  31, 
the  control  cornea  to  statement  number  80.  This  results  in  erasing  the 
current  displays  and  then  n  call  to  the  subroutine  KEYIN,  with  NTEST, 
the  input,  argument,  being  29  or  31.  As  before,  the  call  to  the  subroutine 
KEVIN  is  followed  by  a  call  to  the  subroutine  ELI.PSE  and  the  cycle 
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continues.  If  a  key  corresponding  t<  the  ncmher  of  on  overdeterr.i  nnl 
subspace  was  pressed,  the  control  comes  to  statement  number  61,  and  if 
key  30  was  pressed,  the  di  play  program  comes  to  an  end.  If  the  control 
cones  to  statement  number  61,  the  subroutine  REL  sr,  is  called  to  display 
the  projections  of  the  overdetetiuinad  sufccpace;  Tf  none  of  the  above 
mentioned  keys  is  pressed,  an  error  message,  as  contained  in  the  format 
statement  number  62,  appears  on  the  screen.  The  error  message  continue  . 
to  appear  until  the  user  presses  a  proper  key  or,  in  case  of  singular  or 
redundant  situations,  rectifies  the  situation.  If  the  subroutine  RELPSE 
is  called,  after  the  projections  of  the  overdetermined  subspace  are  dis¬ 
played,  the  user  is  expected  to  press  a  programmed  function  key.  Once 
again,  if  key  29  or  31  is  pressed,  the  current  displays  are  erased,  the 
control  comes  to  statement  number  20  and  the  subroutine  KEY1N  is  called. 
The  program  terminates  if  key  30  was  pressed.  If  the  key  corresponding  to 
the  number  of  the  overdetermined  subspace  was  pressed ,  that  part  of  the 
■current  display  pertaining  to  the  projection  of  the  overdetermined  sub- 
space  is  erased,  and  the  subroutine  RELPSE  is  called  to  display  the  pro¬ 
jections  of  the  overdetermined  subspace  that  is  now  being  requested.  An 
error  message  appears  if  n.ne  of  the  abovementioned  keys  Is  pressed,  and 
■  iie  cycle.  •ont~?n<”'<5  in  manner,  A  flowchart  of  the  subroutine  is 
given  in  figure  5.4.2. 

Subroutine  KKYIN 

The  subroutine  KEYIN  accepts  a  vector  and  the  number  of  the 
variable  from  the  user,  to  form  a  2-dimensional  plane.  It  has  one  input 
argument,  NTEST,  and  an  output  argument,  NAXIS.  The  first  statement  in 
the  subroutine  tests  the  value  of  NTEST.  if  it  equals  31,  chat  parr  of 


Figure  5.4.2a 


Figure  5.4.2b 
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ih<*  RiiKrnnt  lnp  uhtr.h  nrrent.il  a  vector  from  the  user  is  skipped,  and  the 
control  comes  to  statement  number  101.  Otherwise,  the  control  comes  to 
statement  number  99,  and  whatever  appears  on  the  screen  Is  erased  to  pre¬ 
pare  to  accept  a  vector  from  the  user.  On  the  first  call  to  the  sub¬ 
routine  KEY IN,  the  input  argument,  NTEST,  is  set  equal  to  1  so  that  the 
control  invariably  comer  to  statement  99.  On  subsequent  occasions,  how¬ 
ever,  NTEST  will  have  a  value  of  29  or  31.  The  DO  110  loop  displays  the 
transformation  vectors  suggested  by  the  rotation  (CLUSTR)  program  for  the 
information  of  the  user.  The  message  contained  in  the  format  statement 
number  3000,  requesting  the  user  to  supply  a  vector,  th  n  appears  on  the 
screen.  The  program  now  awaits  the  user  to  supply  the  vector.  The 
calls  to  SCTDV  and  DVTDM  subroutines  of  the  GRAF  package  transfer  the 
vector  suoolied  bv  the  user  from  the  screen  to  the  display  variable  table 
and  from  there  to  the  dummy  unit  4.  The  vector  is  read  from  the  dummy 
unit  4  into  the  array  R1NPUT.  Before,  however,  the  vector  is  read,  the  D0211 
ldop  transfers  the  current  values  stored  in  the  RINPUT  array  to  the  RWRKNG 
array.  This  is  done  to  insure  that  the  vector  previously  supplied  by  the 
user  is  not  destroyed  in  case  he  wants  to  continue  with  that  vector.  The 
DO  213  loop  checks  If  the  user  supplied  a  null  vector.  If  so  (as  is  the 
case  when  he  just  wants  to  continue  with  the  previous  vector) ,  this  is 
always  replaced  by  the  previous  non-null  vector  as  would  be  apparent  from 
the  DO  216  loop.  The  only  exception,  as  will  be  seen  from  the  statement 
following  the  statement  number  213,  is  the  first  call  to  the  subroutine. 

This  would  result  in  singularity  and  the  user  will  receive  an  error 
message  instead  of  the  displays. 
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After  the  vector  is  accepted,  the  control  comes  to  statement 
number  101.  The  message  contained  in  the  f  <rmat  statement  number  1658 
appears  on  the  screen.  The  loop  beginning  with  the  statement  number  60 
Insures  that  no  value  other  than  30,  or  the  number  corresponding  to  the 
variable  which  the  user  wishes  to  utilize,  will  be  returned  as  the  value 
of  the  output  argument  NAXIS.  The  user  can,  of  course,  go  back  to  state¬ 
ment  99,  the  beginning  of  the  program,  if  he  presses  key  28.  This  gives 
him  a  chance  to  amend  the  vector  already  supplied.  A  flowchart  of  the 
subroutine  is  given  in  figure  5.4.3. 


Subroutine  ELLPSE 

This  subroutine  is  used  to  display  projections  of  original  data 
points,  and  the  unit  ellipsoids  having  the  metric  hnsed  on  all  points  be¬ 
longing  to  clusters,  onto  the  2-dlmeusiona  l  pJ.ane  rormea  Dy  the  veetuj.  <mu 
th$  variable  supplied  by  the  user.  The  DO  10  and  DO  11  loops  set  up  a 
matrix  R  formally  given  by 


(5.4.1) 


where  1  in  the  first  column  appears 
of  the  variable  chosen  by  the  user, 
the  vector  supplied  by  the  user  and 


in  the  row  corresponding  to  the  number 
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with  the  desired  variable.  If  the  user  supplies  a  vector  coilinear  with 

L  J.  -t.  ^  J  _  C  *-l .t.  .,^..4  *1.1  ^  J  A.1.an.n  M  f  *-U  a  D 

will  have  all  zero  elements.  The  DO  20  loop,  and  the  statement  immedi¬ 
ately  following  this  loop,  check  for  the  above-mentioned  possibility  of 
singularity.  If  no  singularity  is  present,  the  control  comes  to  state¬ 
ment  number  211;  otherwise,  the  error  message,  as  contained  in  the  format 
statement  number  1659,  is  displayed  on  the  screen,  and  control  is  returned 
to  the  subroutine  EXIBIT.  At  statement  211,  the  DO  21  loop  Is  set  up  to 
normalize  the  second  column  of  the  matrix  R.  The  DO  12  loop  picks  up 
the  metric  S-*  based  on  all  points  belonging  to  clusters  and  the  sub¬ 
routines  MPRD  and  GTPRD  of  the  IBM  Scientific  Subroutine  Package  are 
called,  to  form  the  matrix  product  R'S  R.  The  DO  13  loop,  and  the  state¬ 
ment  immediately  following  it,  calculate  the  matrix  product  XR,  where  X  is 
the  matrix  of  original  data  points.  For  a  projected  data  point,  the 
element  of  the  first  column  of  the  matrix  is  treated  as  its  x-coordinate 
and  the  element  of  the  second  column  is  treated  as  y-coordinate.  The  DO 
110  loop  creates  orders  to  plot  points  with  these  sets  of  x  and  y  coordi¬ 
nates.  The  actual  plotting  is  do""  by  displaying,  on  the  screen,  at  the 
place  where  the  point  should  appear,  itc  serial  number,  so  that  the  user 
may  know  which  data  point  projects  into  what  region  of  the  2-dimensional 
display. 

In  the  statements  immediately  following  the  DO  110  loop,  the  semi¬ 
axes  of  the  ellipses  to  be  displayed  are  calculated.  They  are  based  on 
the  metric  R'S  (of  the  projected  ellipsoids).  The  cluster  centers 

i 

(centers  of  ellipses)  are  likewise  transformed  into  ^R.  The  points  des¬ 
cribing  the  circumference  of  the  ellipses 
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(x*  -  j^R)  R' S_1  R  (x  -  R'^)  •=  1  (5. A. 2) 

where  xj  "  (x^,  x^)  are  the  running  coordinates,  are  constructed  by 
angular  sweep.  Beginning  with  an  initial  angle  of  5°  the  DO  150  loop 
calculates  72  different  points  describing  the  circumference  of  the 
ellipses.  The  DO  100  loop  creates  orders  to  plot  these  points  and  the 
ellipses  appear  on  the  screen  when  the  statement  immediately  following 
the  DO  100  loop  is  executed.  A  flowchart  of  the  subroutine  is  given  in 
figure  5.4.4 

Subroutine  RELPSE 

This  subroutine  is  used  to  display  projections  of  ellipsoids 
having  metric  based  on  the  points  belonging  to  the  overdatermined  subspace 
being  superimposed.  If  the  metric  for  the  ith  subspace  is  denoted  by 
S^,  the  subroutine  calculates  the  points  describing  the  circumference  of 
ellipses 

<x'  -  ^R)  R'S^R  <x  -  R'jj^)  ■*  1  (5.4.3) 

where  R  is  as  defined  in  ELLPSE.  The  technique  employed  to  calculate  these 
points  is  similar  to  the  one  employed  before.  Like  ELLPSE,  RELPSE  also 
checks  for  singularity.  If  it  is  present,  no  displays  of  ellipses  appear. 
It  is  to  be  noted  that,  if  the  user  chooses  one  of  the  vectors  suggested 
to  him  in  the  display  (the  vectors  identified  in  the  CLUSTR  program) ,  the 
ellipses  constructed  in  RELPSE,  if  the  user  depresses  the  corresponding 
key,  is  quite  flat,  as  intended.  It;  is  conceivable  that,  in  such  an 
instance,  singularities  could  occur  (though  we  did  not  see  any  in  our 
examples)  and  hence  the  program  checks  for  them. 
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Overlay  Structure 

To  reduce  storage  requirements)  an  overlay  structure  was  de¬ 
signed  as  loiiows: 

1 

i 

r rr  '"i  4 

1.  (Root)  contains  the  main  program  and  Function  KD. 

2.  Contains  subroutine  CALC 

3.  Contains  subroutine  C0R1S 

4.  Contains  subroutine  C0R2 

5.  Contains  subroutine  EXIBIT 

6.  Contains  subroutine  MESSGE 

7.  Contains  subroutine  KEYIN 

8.  Contains  subroutine  RELPSE 

9.  Contains  subroutine  ELLPSE 

Segment  1,  along  with  the  main  program  and  function  KD,  also 
contains  the  system  support  routines  IBCOM0,  ARIH/F,  FIOCS#,  ADCON#,  and 
system  utilities  IHCUATBL,  IHCUOPT  and  IHCTRCH.  Segment  5  contains  all 
of  the  GRAF  routines  required  except  BUFRS ,  CUR$$,  RCUR$,  READSC,  SCNDVDK 
and  SCTDV,  which  are  included  in  segment  7.  With  the  help  of  this  overlay 
structure  and  equivalencingof  a  few  arrays  in  the  ELLPSE  subroutine,  it 
was  possible  to  reduce  the  storage  requirements  to  64K  bytes. 

Deck  Layout  for  ELLIPSE 

//STEP1  EXEC  FORTGC 

/ /FORT . SYSLIN  Di)  DSN AME=J.f<CHAIN (ROOT)  ,SPACE=(TRK,  (150,10,5))  ,  C 

UNIT=SYSDA,D1SP= (NEW,  PASS) 


//FORT.SYSIN  DD  * 


Main  program  here 

/a 
/ - 

//STEP 5  EXEC  FORTGC 

//FORT.SYSLIN  DD  DSNAME«&SCHAXN (LINKA3)  ,D1SP*» (MOD, PASS)  ,UNIT*=SYSDA 
//FORT.SYSIN  DD  * 

Function  KD  Source  Deck 
/* 

//STEP2  EXEC  FORTGC 

//FORT.SYSLIN  DD  DSNAME«&&CHA1N (L1NKA) ,DISP« (MOD. PASS) ,UNIT*SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  CALC  Source  Deck 
/* 

//STEP7  EXEC  FORTGC 

//FORT.SYSLIN  DD  DSNAME-&&CHAIN (LINKA5) ,DISP* (MOD, PASS) ,UNIT*SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  COR1S  Source  Deck 
/* 

//STEP6  EXEC  FORTGC 

//FORT.SYSLIN  DD  DSNAME-&&CHAIN(LINKA4) ,DISP» (MOD, PASS), UNIT-SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  COR2  Source  Deck' 

/* 

//STEP 4  EXEC  FORTGC 

//FORT.SYSLIN  DD  DSNAME“&&CHAIN (LINKA2) ,DISP= (MOD, PASS) ,UNIT=SYSDA 
//FORT.SYSIN  DD  * 


Subroutine  EXIBIT  Source  Deck 


Subroutine  EXIBiT  Source  Deck 
/* 

//STEPO  SXEC  FOETCC 

//FORT . SYSLIN  DD  DSNAME-&& (CHAIN (LINKA8)  ,DISP*> (MOD, PASS)  .UNIT-SYSDA 
//FORT. SYS IN  DD  * 

Subroutine  MESSGE  Source  D jck 
/* 

//STEPIO  EXEC  FORTGC 

//FORT. SYSLIN  DD  DSNAME-&&CHA1N(LINKA9)  ,diap« (MOD, PASS)  .UNIT-SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  KEYIN  Source  Deck 

/* 

//STEPS  EXEC  FORTGC 

//FORT .  SYSLIN  DD  DSNAME«&£.CHAIN  (LINKAI)  ,DISP-  (MOD, PASS)  .UNIT-SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  ELLPSE  Source  Deck 

/* 

//STEP8  EXEC  FORTGC 

//FORT .  SYSLIN  DD  DSNAME-S.&CHAIN  (LINKA6) ,DISF»  (M0D,PASS)  ,CNIT°SYSDA 
//FORT.SYSIN  DD  * 

Subroutine  RELPSE  Source  Deck 
/* 

//S10  EXEC  LKED , PARM=*  (LET , I.I ST , OVLY , XREF) 

//LKF.D .  SYSLMOD  DD  DSN*SYS1  .GRAP1ILIB  (ELLIPSE)  , DISPOSER, 

//  SPACE=(TRK, (0,0) 

//l.KED, SYSI.IB  DD  DSN=SYSl.CRAniB,DISP=SHR 


//  DD  DSN-S YS 1 . UGALIB , DISP-SHR 
//  r\r\  nCMaCVQI  .  SRPt.TB.DT3P*SHR 

/  f  “  ~ - - 

u  w  DSN-SYS1 . FORTLIB , DISP-SHR 
//  DD  DSN-SVS1.UNKLIB, DISP-SHR 
//  DD  DSN-SYS1 . GRAPHLIB , DISP-SHR 
//LKED. MODULE  DD  DSN-&&CHAIN.DISP-OLD 
//LKED.SYSIN  DD  * 

INCLUDE  MODULE (ROOT) 

INCLUDE  MODULE (LINKA3) 

INCLUDE  SYSLIB (IBCOM0) 

INCLUDE  SYSLIB (ARITH# ) 

INCLUDE  SYSLIB  (FlOCSi?) 

INCLUDE  SYSLIB  (ADCON If) 

INCLUDE  SYSLIB (IHCUATBL) 

INCLUDE  SYLIB (1HCUOPT) 

INCLUDE  SYSLIB (EKRMON) 

INCLUDE  SYSLIB (IHCTRCH) 

OVERLAY  ONE 
INCLUDE  MODULE (LINKA) 

'  INCLUDE  SYSLIB (SINV) 

INCLUDE  SYSLIB  (MFSO) 

OVERLAY  TWO 

INCLUDE  MODULE  (LINKA5) 

OVERLAY  TWO 

INCLUDE  MODULE  (LINKM) 

OVERLAY  ONE 

INCLUDE  MODULE  (L1NKA2) 


INCLUDE  SYSLIB (GAFERR) 
INCLUDE  SYSLIB (LIGHTS) 
INCLUDE  SYSLIB (WRFMT$) 
INCLUDE  SYSLIB (DETEKT) 
INCLUDE  SYSLIB (PLOT) 
INCLUDE  SYSLIB (DETAIN) 
INCLUDE  SYSLIB (DISPLA) 
INCLUDE  SYSLIB ($$OVER) 
INCLUDE  SYSLIB (CHAR) 
INCLUDE  SYSLIB (POINT) 
INCLUDE  SYSLIB (LINE) 
INCLUDE  SYSLIB (PLACE) 
INCLUDE  SYSLIB($$$$BT) 
INCLUDE  SYSLIB($$INIT) 
INCLUDE  SYSLIB (DUMMY$) 
INCLUDE  SYSLIB ($VOVER) 
INCLUDE  SYSLIB (CLOSE) 
INCLUDE  SYSLIB (LINE$$) 
INCLUDE  SYSLIB (UNPLOT) 
INCLUDE  SYSLIB (PLACE$) 
INCLUDE  SYSLIB (POINT?) 
INCLUDE  SYSLIB (ERASE) 
INCLUDE  SYSLIB (BLANK) 
INCLUDE  SYSLIB (RPCET) 
OVERLAY  TWOA 
INCLUDE  MODULE  (LINKAS) 


OVERLAY  TWOA 


INCLUDE  MODULE  (LINKA9) 
INCLUDE  SYSLIB^SCNUVDK) 
INCLUDE  SYSLIB(CUR$$)- 
INCLUDE  SYSLIB(BUFRS) 
INCLUDE  SYSLIB(SCTDV) 
INCLUDE  SYSL1B(RCUR$) 
INCLUDE  SYSLIB(READSC) 
OVERLAY  TWOA 
INCLUDE  MODULE  (LINKA6) 
OVERLAY  TWOA 
INCLUDE  MODULE  (LINKA1) 


CHAPTER  VT 


USER’S  GUIDE 

6.1  Introduction 

The  user  who  Is  Interested  in  using  the  programs  described  in 
the  previous  chapter  would  find  himself  in  one  of  the  following  situations: 

(i)  He  may  not  have  analysed  his  multivariate  data  and  may  not  yet  have 
identified  clusters  and  subspaces.  If  so,  he  should  first  subject  his 
data  to  the  CLUSTR  program.  The  next  section  contains  Instructions  on 
how  to  use  (execute)  this  program. 

(ii)  ne  may  have  auuiyueu  ills  uaca  using  die  CLUSTR  program  liuL  ires  not. 
loaded  the  data  set  for  the  ELLIPSE  program.  If  so,  he  should  execute  the 
IEBGENER  Utility  rou^.  .*e  and  load  the  data, set.  This  is  described  in 
section  6.3. 

(iii)  Finally,  the  user  may  have  gone  through  the  steps  (i)  and  (ii)  above 
and  may  want  to  see  the  displays  of  projected  clusters  and  subspaces.  The 
use  of  the  ELLIPSE  program  including  an  indication  of  what  to  look  for  in 
the  displays  is  described  in  section  6.4. 

Section  6.5  contains  an  illustration. 

6.2  The  CLUSTR  urogram 

The  analysis  of  the  user's  multivariate  data  begins  with  the 
identification  of  clusters  and  overdetermined  subspaces,  if  any.  For  this 
purpose,  the  user  must  first  analyse  his  data  using  the  CLUSTR  program. 
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This  is  a  batch  program.  Hie  following  data  cards  need  to  be 
supplied: 


Data  Card  1 

Columns  1-3,  number  of  points  (individuals  or  experimental  units) 

Columns  6-7,  number  of  variables  (responses)  measured  on  each 
experimental  unit 

Columns  8-11,  alpha  level  for  cluster  core  on  first  pass 
(suggested  value  0.90) 

Columns  13-16,  alpha  level  for  cluster  extension  on  first  pass 
(suggested  value  0.50) 

Columns  18-21,  alpha  level  for  cluster  core  on  second  pass 
(suggested  value  0.90) 

Columns  23-26,  alpha  level  for  cluster  extension  on  second  pass 
(suggested  value  0.50) 

Columns  28-31,  alpha  level  for  cluster  core  on  third  pass 
(suggested  value  0.90) 

Columns  33-36,  alpha  level  for  cluster  extension  on  third  pass 
(suggested  value  0,50) 


Data  Card  2 

This  is  a  variable  format  card  and  should  contain  the  FORMAT  by 
which  each  experimental  unit  is  to  be  read. 

The  remaining  data  cards  contain  the  observations,  one  card  (or  record 

which  may  consist  of  several  cards)  contains  the  coordinates  of  one  point. 

The  numbering  of  these  points  is  implicit,  according  to  the  sequential 

order  of  these  cards. 

Cur  suggestion  above  that  0.90  should  be  used  as  alpha  level  for 
cluster  core  and  0.50  as  alpha  level  for  cluster  extension  is  empirical. 

Of  course,  he  can  use  any  .other  set  of  values.  For  a  detailed  discussion 
of  this  matter  the  reader  is  directed  to  Craney  [10].  v 
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The  output  of  this  program  has  been  discussed  at  length  in 
section  *».2  The  punched  deck  produced  by  this  program  is  required  for 
the  ELLIPSE  program. 

6.3  IEBCENER  Utility  routine 

As  explained  in  section  5.3,  it  is  necessary  to  execute  the 
IEBGENER  Utility  routine  to  load  the  output  of  the  CLUSTR  program  into  a 
data  set  required  as  input  to  the  ELLIPSE  program.  The  deck  set-up  for 
the  execution  of  this  utility  routine  is  as  follows: 

(i)  JOB  card 

(ii)  //STEP G  EXEC  PCM-IEBGENER 

(iii)  //SYSPRINT  DD  5YS0UT=A 

(iv)  //SYSIN  DD  DUM1!Y 

(v)  //SYSUT2  DD  DSN-SYS1 .R2250 ,VOL=SER=UGA231,DISPBSKR,UNIT*2314 
(vl)  //SYSUT1  DD  DATA , DCG- (RECFM-FB , LRECL=80 , BLKS 1ZE*=320) 

(vii)  Data  Cards 
(viii)  /* 

The  data  cards  consist  of  a  header  card  followed  by  the  punched  deck 
produced  by  the  CLUSTR  program  (of  course,  the  user  could  produce  his 
own  data  cards,  and  any  assignment  of  points  to  clusters  or  subspaces 
which  he  desires.  In  this  respect  the  CLUSTR  program  is  merely  intended 
to  give  him  guidance-— but  a  very  strong  one  indeed) .  The  header  card  is 
made  up  as  follows  (all  numbers  right  justified); 

Columns  i -4 ,  numbers  of  points  (individuals  or  experimental  units) 

Columns  5-8,  number  of  variables  (responses)measured  on  each 
experimental  unit 
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Columns  9-12,  number  of  clusters  Identified  by  the  CLUSIR 
program 

Columns  13-16,  number  of  overdetermined  subspaces  (simple 

BtrucLuic  aoluticns)  identified  bv  f.LUSTR  program 

After  the  IEBGENER  Utility  routine  is  executed,  one  is  ready  for  the 
ELLIPSE  program. 

6.4  The  ELLIPSE  program 

This  program  works  under  the  control  of  GMS  (Graphics  Monitor 
System).  For  greater  detail  on  the  operation  of  GMS  see  Penn  [31],  In 
order  to  be  operational  under  the  conversational  GMS,  the  load  module  of 
the  program  has  to  be  a  member  of  the  partitioned  data  set  GRAPHLIB.  The 
user  should  verify,  by  typing  the  command  $NAMES  on  the  console  typewriter, 
that  the  program,  in  fact,  is  a  member  of  the  GRAPHLIB  data  set.  If  not, 
the  user  will  first  have  to  compile  and  link  edit  the  program  using  the 
deck  set-up  given  in  section  5.4.  The  user  can  then  execute  the  program 
using  the  command  $LINK  ELLIPSE  to  link  to  it. 

Photographs  made  during  the  use  of  the  program  are  reproduced  here. 
The  user  will  find  it  helpful  to  refer  to  them  while  studying  the  rest  of 
the  section.  Some  of  the  photographs  will  be  specifically  discussed  in 
the  next  section. 

The  execution  of  ELLIPSE  begins  with  the  display  of  an  informative 
message.  The  user  should  carefully  read  the  message.  (Note  especially 
the  use  of  the  programmed  function  key  30.  This  is  to  be  pressed  only 
when  it  is  desired  to  stop  the  execution  of  the  program.)  The  user  should 
their  press  any  key  other  than  0  or  30.  He  is  now  asked  to  type  the 
coordinates  of  a  vector  which  he.  wishes  to  utilize  in  order  to  form  a 


72 


iHlS  PROGRAM  DISPLAYS  CLUSTERS  BY  PROJECTING  THEN  ON 
ufioiniic  9 _n 1 mpu£ t ON AL  SUBSPACES  .SI  HPLE  STRUCTURE 

•  «•*  *  MW'**  •»  *  '  — 

SOLUTIONS  C*N  ALSO  BE  INDICATED. REFER  TO  YOUR 
INSTRUCTION  CHART. HAVE  YOU  ENTERED  ALL  NECESSARY 

DATA  VIA  IEBGENER? 

THE  BOTTOM  ROW  OF  THE  PROGRAM 

FUNCTION  KEYS  IS  LIT  UP. THEY  WILL  USED 
AS  DIRECTED. THE. DARK  ONEiKEY  NO*  $Q*1S 
THE  PANIC  BUTTON. IT  HILL  RETURN  YOU  TO  THE 

MONITOR . 

IN  CASE  OF  PANIC  PRESS  KEY  30 

WOW  PRESS  ANY  KEY  TO  GET  STARTED 
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ncrnpniMG  TO  THE  ROTATION  PROGRAM  THE 
FOLLOWING  VECTORS  TRANSFORM  RAM  DATA  INTO 
SIMPLE  STRUCTURE  SOLUTIONS  NO. I  TO  3 

"3-Stn:*«-5:i5§  eif it, 

-o!3©<>-e.25,J-e. jh  o . a<»0 

5ste»Sre  RawsiaawKWt^  «»« 

58{tU  KSl'tSV  AFTER  EACH  COORDINATE 

x<  i  >=_ 

X(2)s 
XU>  = 

X(^)- 
X<5)  = 

X(G)  = 

iP^oSevinus  VECTOR  OKiJUST  PRESS  EQB 
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ACCORDING  TO  THE  ROTATION  PROGRAM  THE 
FOLLQWtNG  VECTORS  TRANSFORM  RAW  DATA  INTO 
SIMPLE  STRUCTURE  SOLUTIONS  NO.1  TO  J 
-0. 101-0.215  O.GSb-O.SO! 

0.012  0.813-0. 155  0.51b 
-0.30 (>-0.251-01. 3 1b  0.8b© 


ENTER  A  TRANSFORMATION  VECTOR. 

NO  MORE  THAN  1  CHARACTERS  >  DEC I  HAL  POINT 
MUST  BE  TYPED 

DEPRESS  JUMP  KEY  AFTER  EACH  COORDINATE 


X  (  1 ) --© . 30  b 
X(2)=-0.251 
X(3J=-0.31b 
X(l}=0.Cb0_ 
X(5)  = 


X(U  = 
X(l)  = 


iPpREVlOUS  VECTOR  OK. JUST.  PRESS  EQB 
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PRESS  ONE  KEY  CORRESPONDING  TO  THE  AXIS 
-  VARIABLE  YOU  WISH  TO  SELECT .  OR  30  IF 
YOU  WISH  TO  STOP. IF  YOU  WISH  TO  MAKE 
CHANGES  IN  YOUR  VECTOR  PRESS  KEY  28. 
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2-dimonsional  plane,  the  other  axis  being  an  observable  variable.  Notice 

that  vectors  which  transfoiw  the  origins;!  Ant*  points  into  overdetermined 

subspaces  appear  In  this  display  for  the  user's  ready  reference.  The  user 

should  try  one  or  more  of  these  vectors  but  he  nay  also  supply  any  number 

of  other  vectors,  if  he  feels  that  this  would  help  him  Interpret  the 

structure  of  his  data.  The  coordinates  of  the  vector  Intended  to  be  used 

should  be  typed  in  through  the  alphameric  keyboard.  The  place  where  the 

digit  (or  any  character  for  that  matter)  typed  will  appear  on  the  screen 

is  indicated  by  a  cursor.  The  use  of  the  "JUMP"  key  after  one  coordinate 

is  entered,  will  cause  the  cursor  to  move  over  to  the  place  for  the  next 

« 

coordinate.  After  all  the  coordinates  of  a  vector  are  entered,  the  user 
should  press  EOB.  This  is  done  by  pressing  both  the  'ALT'  and  the  '5'  key 
of  the  alphameric  keyboard.  The  first  time  the  user  is  asked  to  supply  a 
vector,  he  must  give  a  non-null  vector.  (Later  on,  null  vectors  will  be 
acceptable  and  simply  mean  that  there  is  no  change.  At  that  time  the  user 
would  just  press  EOB  when  this  display  appears  and  thus  indicate  that  he 
does  not  wish  to  change  the  previous  vector.) 

After  the  vector  is  supplied,  the  user  is  asked  to  choose  a  vari¬ 
able  as  the  other  axis  of  a  2-dimensional  plane.  The  choice  is  made  by 
pressing  the  programmed  function  key  corresponding  to  the  number  of  the 
variable;  if  variable  number  1  is  desired,  press  key  1,  etc. 

After  a  vector  and  a  variable  have  thus  been  chosen,  the  desired 
2-dimensional  plane  will  appear  on  the  screen.  The  abscissa  corresponds 
to  the  variable,  the  ordinate  corresponds  to  the  chosen  vector.  The  legend 
(numbers  given  alongside  the  coordinate  axes)  indicates  minimum  and  maxi¬ 
mum  values.  The  serial  number  of  each  data  point  is  projected  at  the 
appropriate  coordinates.  Ellipses,  i.e.,  projections  of  unit  ellipsoids, 
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based  upon  the  within-cluster  metric  of  all  points,  are  drawn  around 
each  cluster  center.  These  eliip&ua  generalize  the  concept  of  «  standard 
unit  interval  in  one  dimension. 

One  aspect  to  be  observed  on  the  screen  at  this  time  is  whether 
the  points  which  supposedly  belong  to  the  same  cluster,  are  close  to¬ 
gether  (regardless  which  axes  or  vectors  are  used) ,  and  whether  one  could 
distinguish  them  visually  from  points  belonging  to  a  different  cluster. 
There  may  be  some  overlaps  but  the  clusters  should  be  visually  distinct. 

If  certain  points  exhibit  the  above  phenomenon  in  all  displays,  then  they 
can  be  regarded  as  forming  a  cluster. 

The  user  has  now  a  choice.  He  may  wish  to  superimpose  one  of  the 
overdetermined  subspaces,  or  he  may  wish  to  change  the  vector,  or  the 
variable,  or  both.  To  change  the  vector,  he  presses  key  29,  to  change 
the  number  of  the  variable,  he  presses  key  31,  and  to  superimpose  an  ever- 
determined  subspace,  he  presses  the  key  corresponding  to  the  number  of 
that  subspace. 

Superimposition  of  an  overdetermined  subspace  results  in  the 
display  of  projections  of  unit  ellipsoids  having  metric  based  on  only 
those  points  which  belong  to  the  overdetermined  subspace.  If  the  plane 
of  projection  selected  is  orthogonal  to  the  over determined  subspace,  the 
projections  of  unit  ellipsoids  under  reference  will  be  flat  and  elongated. 
Further,  the  projections  of  the  points  lying  on  the  subspace  will  make  a 
narrow  band  (almost  resembling  a  straight  line).  The  plane  of  projection 
would  be  orthogonal  to  the  overdetermined  subspace,  if  the  vector  which 
transforms  the  original  data  points  into  this  overdetermined  subspace  is 
supplied  as  the  desired  vector.  As  many  different  planes  as  there  arc. 
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dimensions  can  be  constructed  by  taking  this  normal  vector  against  each 
of  the  variable  axes. 

The  plane  of  projection,  in  a  way,  i»  the  direction  from  which  we 
look  at  the  ellipsoids  embedded  in  the  p-dimenslonal  space.  The  ellip¬ 
soids  having  metric  based  on  the  points  belonging  to  an  overdetermined 
subspace  must  necessarily  be  flat  and  elongated.  However,  they  must  be 
viewed  from  the  proper  direction.  Otherwise,  this  elongation  may  not 
appear  or,  in  some  instances,  they  will  appear  as  very  small  circles. 

Once  again,  the  user  can  press  key  29  to  supply  a  new  vector,  key 
31  to  change  the  number  of  the  variable  and  the  key  corresponding  to  the 
number  of  an  overdetermined  subspace  to  superimpose  the  subspace.  The 
program  continues  in  this  manner.  It  can  be  terminated  at  any  time  by 
pressing  key  30. 

It  should  be  noted  that  the  projections  onto  a  plane  formed  by  any 
pair  of  observable  variables  is  a  special  case.  If  the  user  desires  to 
have  projections  onto  the  plane  formed  by  variables  1  and  2,  say,  all  he 

has  to  do  is  supply  (1.0,  0.0 . 0.0)  as  the  vector  and  press  key  2. 

In  fact,  since  the  program  does  not  necessarily  require  normalised  vectors 
as  input,  any  vector  of  the  form  (a,  0,  0,  ...  ,  0)  where  a  ^  0,  results 
in  the  selection  of  variable  1  as  one  of  the  axes. 

6 . 5  An  Illustration 

The  programs  described  above  were  used  in  the  analysis  of  artificial 
data.  The  data  consisted  of  45  points  and  4  variables.  These  data  were 
generated  in  the  following  manner.  First,  180  normal  deviates  with  zero 
mean  and  unit  variance  were  generated;  they  made  up  the  180  elements  of  a 
45  x  4  matrix,  numbered  column-wise.  The  first  25  measurements  on  tlie  4th 
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variable  were  then  redefined  by  the  relation 

z(I)  *>  z(I)/'iG  +  (z(x  -  +  2(1  -  yo)  +  *(1  -  45)7/3 

1  -  136,  137~,  .  .  .  ,  160 

i.e.,  the  first  25  observations  on  variable  4  were  replaced  by  one  tenth 
of  the  original  observation  plus  the  mean  of  the  first  three  variables. 
Similarly,  the  last  25  observations  on  variable  3  were  replaced  by  the 
relation 

z(I)  -  z(I)/10  +  (z(I  -  90)  +  z(I  -  45)  +  z(l  -  45) )/3 
I  -  111,  112,  .  .  .  ,  135 

The  modification  of  the  data  matrix  by  these  two  operations  built  in  2 
subspaces.  In  effect,  the  first  25  observations  on  variable  4  became 
almost  a  linear  combination  of  the  remaining  3  variables  and  the  last  25 
observations  on  variable  3  similarly  became  almost  a  linear  combination 
of  the  remaining  3  variables.  Still,  however,  all  the  variables  had  zero 
mean,  and  to  Introduce  mean  shifts,  the  vectors  (3,  9,  5,  9),  (5,  9,  7,  3) 
and  (7,  3,  7,  5)  were  added  to  the  first  15  observations,  second  15  obser¬ 
vations  and  the  last  15  observations  respectively.  Thus  the  entire  data 
matrix  became  a  simulated  sample  drawn  from  normal  populations  with  mean 
vectors  (3,  9,  5,  9),  (5,  9,  7,  3)  and  (7,  3,  7,  5)  and  two  built  in  sub¬ 
spaces.  This  data  matrix  was  then  subjected  to  the  first  program  of 
cluster  identification  and  subspace  determination.  This  program  correctly 
assigned  the  first  15  points  to  one  cluster,  the  second  15  points  to 
another  cluster  and  the  last  15  points  to  a  third  cluster.  The  subspace 
identification  part  of  the  program  gave  3  overdetermined  subspaces.  This 
was  not  at  all  surprising- since  the  two  planes  were  already  built  in  and, 
as  frequently  happens  in  such  instances  (see  section  4.2),  a  plane  was 
found  somewhat  in-between  the  two  constructed  planes.  The  cosines  between 
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the  normals  Co  these  three  planes  were  as  follows: 
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The  planes  Identified  were  as  under  (serial  numbers  of  points  belonging  to 
the  planes  are  given) . 


Plane  1:  —12,  21,  22,  23,  24,  25,  27,  28,  29,  30,  31,  32,  33,  34,  35,  37, 

38,  39,  40,  41,  42,  43,  44,  45,  (24  points) 

Plane  2:  —2,  9,  14,  15,  17,  18,  29,  30,  32,  34,  37,  39,  (12  points) 

Plane  3:  —1,  2,  4,  7,  8,  9,  10,  11,  13,  i  »,  15,  16,  17,  18,  19,  20,  21, 

22,  24,  25,  (20  points) 

According  to  the  built-in  subspaces,  one  subspace  should  have  contained  the 
last  25  points,  i.e.,  points  21-45  and  the  other  subspace  should  have  con¬ 
tained  the  first  25  points.  Plane  1  given  above  did  pick  up  23  of  the 

last  25  points  and  an  extra  point  number  12.  Likewise,  plane  3  picked  up 
20  of  the  first  25  points.  Plane  2  picked  up  6  of  the  points  belonging  to 
,  plane  1  and  6  of  the  points  belonging  to  plane  3.  Thus  plane  2  is  somewhat 
of  a  mixture  of  the  planes  1  and  3. 

The  results  of  the  CLUSTR  program  were  then  displayed  using  the 
ELLIPSE  program.  Notice  especially  the  following  displays  in  which  the 
2-dimensional  planes  were  formed  by  relecting  a  suggested  vector  (-0.306, 
-0.259,  -0.316,  0.860)  and  an  observable  variable.  The  vector  under  con¬ 
sideration  is  the  vector  3,  which  transformed  the  original  data  points  into 
simple  structure  plane  3. 
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(a)  Variable  1  against  vector  3 — After  normalizing,  the  vector  reduced  to 

(0.0,  -0.272,  -0.332,  0.903).  Plane  3  was  superimposed.  Points  number 

1,  2,  3,  4,  5,  7,  8,  9,  10,  11,  12,  13,  14,  15,  16,  17,  IB,  19,  20,  21, 

• 

22,  23,  24,  25,  26  can  be  3een  to  be  lying  on  the  simple  structure  plane. 


VAtU&lE  UQAIHST 
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(b)  Variable  2  against  vector  3 — After  normalizing,  the  vector  reduced  to 

9  *  /  f\  *»  f\  n  r>  *\nn  r\  ni  .  .  ,  *>  ,  <  «  •  »  ’ 

\  y  •  4  i.uuv  v  wtto  bupciJLAuajiUocu*  i  biuto  uuwuci 

1,  3,  5,  6,  7,  8,  9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23, 
24,  25,  26  can  be  seen  lying  on  the  simple  structure  plane. 
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(c)  Variable  3  against  vector  3 — After  normalizing,  the  vector  reduced  to 
(“0.322,  -0,273,  0.0,  0.906).  Plane  3  was  superimposed.  Points  number 
1-26  can  be  seen  lying  on  the  simple  structure  plane. 
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(d)  Variable  4  against  vector  3 — The  vector,  after  normalizing,  reduced  to 
(-0.599),  -0.507,  -0.619,  0.0).  Plane  3  was  superimposed.  Points  number 
1-27  with  the  exception  of  number  5  can  be  seen  lying  uu  the  simple 
structure  plane. 


Th*  »hcvi»  A  (H  unlays  pertain  to  each  of  the  4  variables  a  Rains  t 
vector  3  and  superlmposltlon  of  simple  structure  plane  3.  Vector  3  is  the 
vector  which  transforms-  the  raw  data  Into  simple  structure  plane  3.  Points 
1-25  with  the  exception  of  3,  5,  6,  12  and  23  lie  on  this  plane.  Vector  3 
la  the  normal  to  this  simple  structure  plane  3.  Hence  any  plane  passing 
through  vector  3  is  orthogonal  to  the  simple  structure  plane  3.  When  pror 
jections  onto  this  orthogonal  plane  are  taken,  the  points  lying  on  the 
simple  structure  plane  should  fall  within  a  narrow  band  (almost  resembling 
a  straight  line)  and  the  above  4  displays  clearly  bring  out  this  fact. 
Variables  1,  2,  3,  and  4  make  4  different  planes,  respectively,  passing 
through  vector  3  and  all  orthogonal  to  the  simple  structure  plane  3. 

This  can  also  be  thought  of  as  rotating  a  plane  passing  through  vector 
3  around  the  vector  3.  Projections  are  taken  when  this  rotating  plane 
passes  through  the  axis  corresponding  to  each  of  the  variables.  Attention 
should  also  be  drawn  to  these  displays. 
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(e)  Variable  1  against  vector  3 — The  vector,  after  normalising,  reduced 
to  (0.0,  -0.272,  -0.332,  0.903).  Simple  structure  plane  1  was  superimposed. 
Since  the  plane  passing  through  vector  3  and  the  axis  corresponding  to 
“variable  1  is  not  orthogonal  to  simple  structure  plane  1,  we  do  not  see 
points  lying  on  a  narrow  band  resembling  a  straight  line  in  this  display. 
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(C)  Variable  2  against  variable  1— Plane  3  was  superimposed.  Once  again, 
for  the  reasons  mentioned  in  (e)  above,  we  do  not  see  a  good  simple 
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from  the  proper  position. 
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(g)  Variable  4  against  vector  1 — The  vector,  after  normalizing,  reduced 
to  (-0.325,  -0.289,  0.900,  0.0).  Plane  1  was  superimposed.  Since  any 
plane  passing  through  vector  1  is  orthogonal  to  simple  structure  plane  1, 
va  (ivnert  fn  baa  a  gnnH  simple  structure  in  this  display  snd  va  dm. 

Points  number  21,  22,  23,  25,  26,  28,  30,  31,  32,  33,  34,  35,  37,  38,  39, 
40,  41,  42,  43,  44,  45,  lie  within  a  narrow  band  resembling  a  straight 
line. 
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The  user  should  find  the  other  displays  easy  to  understand.  A 
good  simple  structure  is  seen  only  when  a  plane  is  selected  which  passes 
through  a  vector  that  transforms  the  raw  data  into  a  simple  structure 
solution,  and  the  corresponding  simple  structure  plane  is  superimposed. 

It  should,  however,  be  noted  that  the  clustering  of 'points  is  not  affected 
by  this  principle  and  hence,  in  all  displays,  the  clusters  can  be  easily 
identified. 
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APPENDIX  A 


Source  Listings  £or  CLUSTR  Program 
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